Deep learning + robotics hasn't lived up to its promise since 2015, but this time might be different.
Xander Dunn, 1 January 2024
In 2014 I was working at Apple on iOS and the first version of watchOS when I saw DeepMind's Atari results making headlines. At that moment I decided AI was the future and focused my career on it. In 2015 I joined Osaro, a deep RL for robotics startup. At the time I told my friends Osaro or a company like it would be worth as much as Twitter within 5 years. That did not happen! I don't believe there is any robot intelligence company that has rivaled Twitter in value at any point in time since 2015 ($10B-60B). Unlike text, audio, and image, AI for robotics has not had a breakout moment with proliferating use cases.
When discussing intelligent robots, many are quick to bring up Boston Dynamics. As far as I can tell, Boston Dynamics robots have no intelligence. Boston Dynamics built some interesting hardware and has some classical methods of stabilization. Some are quick to defend, "Well that's the hard part." Clearly, it isn't the hard part, or robotics would be solved and useful and we would see these robots everywhere in our daily lives. By definition, the hard part is whatever no one can do yet. Boston Dynamics has failed to meet market success and it's being passed around like a hot potato: Google bought it in 2013, sold it to SoftBank in 2017, and then SoftBank sold it to Hyundai in 2020. Boston Dynamics produces cool-looking hard-coded demos but has achieved very few use cases.
At Osaro we were working with KUKA iiwa industrial robot arms to perform pick and place tasks with the goal of industrial manufacturing automation. We quickly realized that data was a limiting factor and started using imitation learning. A human with a PlayStation controller would manually control the robot to complete tasks. The model would be trained on these trajectories, and it would indeed generalize to new situations. This approach remained data-limited because it required human-in-the-loop data collection, and the data samples were specialized trajectories involving the control of a robot with non-intuitive motion. We briefly pursued simulation as another avenue for overcoming data limitations but didn't sufficiently focus on it to benefit at scale. The sim2real step of transferring simulated learnings into physical world action requires effort to make the simulated data useful.
I think the wall we're hitting here is Sutton's Bitter Lesson: Scale is the secret ingredient to getting our state-of-the-art learning algorithms to work well. We can't have scale without data, and the physical world is really slow and expensive at producing data. chatGPT has the entire Internet of text to learn from. Midjourney and DALL·E have the entire Internet of images to learn from. Every hour of every day, billions of Internet-connected humans are generating new data for these models to learn from. These truly massive datasets allow training truly massive language models. All of the spaces where AI has achieved breakout usage have been built on the backs of massive datasets. But in robotics, we don't have billions of people generating data on how a robot should move and behave in the world. We don't have massive robotics control datasets.
Across the board, the robot intelligence industry has hit these walls and scaled back research and funding. In 2021, OpenAI completely abandoned all robotics research. Amazon abandoned its home delivery robot. Amazon axed its re:MARS robotics and AI conference. Google's industrial robotics company Intrinsic is laying people off. New LLM companies just a couple of months old are raking in hundreds of millions of dollars with below-SOTA results and murky business models. Meanwhile, robotic intelligence companies are doing layoffs, tightening belts, and struggling to get funding.
As a result of these challenges, it has become widely believed among deep learning researchers that physical embodiment may be the last thing that AI conquers. A common line of thought is that it may be easier to train an AI that can write code to control a robot than it is to train an AI directly on robot control.
Historically most of the robotics + AI companies, including Osaro, have focused solely on the software and left the hardware to someone else to figure out, or vice versa. Right now, there's a wave of companies that are attempting to tackle both problems. Tesla is particularly advantaged here because they have experience building Autopilot, which is the world's most abundant physically embodied AI. Boston Dynamics may have some of the best hardware people in the world, but they definitely do not have some of the best AI researchers in the world. Tesla is the very rare place that has both. Other companies working in this direction include Figure and 1X Technologies, which OpenAI invested in. At Osaro when we encountered a limitation with the hardware, cost, or hardware drivers, we had to accept it as fact. These vertically integrated companies can directly adapt their hardware to the needs of their software, which means the hardware is actually made for the AI that's running it.
It's important not just to collect lots of data, but to collect the right data. Given insufficient hardware sensors, it's possible to collect an enormous amount of data that is ultimately unlearnable. When the data itself conflicts on what action to take given the same inputs, a model can correctly learn high uncertainty and do nothing. This makes the feedback loop between hardware design and AI training critical. Designing the right hardware to make the data learnable is necessary for overcoming the aleatoric noise in physical world interaction.
One of the revelations in deep learning over the past couple of years has been the value of foundation models. If we pre-train a language model on massive amounts of text, we find that this pre-trained model can generalize to new language tasks very readily. With in-context learning or finetuning, we get state-of-the-art, useful performance on nearly any language task with just a few examples of the new task we're trying to solve. This has greatly improved the data efficiency of deep learning for many NLP use cases. The same is true for images and audio. Our own experience training LLMs on 8,000 hardware accelerators show impressive generalization and fine-tuning capabilities, but data quality remains important. Now research is focused on applying the same insights to robotics. These recent papers have shown some success in applying the foundation model approach to robotics:
Foundation models are not without their own challenges, of course:
The holy grail is a pre-trained robot that can generalize to new tasks with either in-context learning or minor fine-tuning on a few human demonstrations. The use of pre-trained vision and language models is a good start. The next step requires a progressively growing dataset from robot interactions.
I'm not the first to describe robot intelligence in terms of a Bitter Lesson data limitation that might be lifted by foundation models. See also Karol Hausman making this point here, a DeepMind coauthor on several of the papers listed above.
The form factor of the humanoid with hands and fingers could be incredibly important for breaking through the data limitation of the physical world. I expect that (human video demonstration) → (industrial robot control) is solvable, but much more challenging than (human video demonstration) → (humanoid robot control). In the latter case, there's a direct mapping between digits, size, movements, and positions. As an example, Autopilot intervention data is a close 1:1 mapping: if the human pushes the brake, then the AI should push the brake. Unlike an industrial robot arm, a humanoid robot has a close mapping to human action: if the human puts two fingers here and three fingers there, then the robot should do the same. This potentially reduces the difficulty of the problem down to animation retargeting. There are of course plenty of edge cases and exceptions. A classic Autopilot exception is humans rolling through stop signs rather than fully coming to a stop. On the whole, the mapping from human actions to human-like robot actions is much clearer than human actions to very inhuman robot actions.
On our way to the holy grail of real-time in-context learning from human example, we need to pursue paths to produce many orders of magnitude more robotics data than we've seen to date. The ability for a human to naturally demonstrate how to perform the task helps with collecting the scale of data required. It is challenging work to control a robot with completely different digits to perform a task, but any factory line worker or person doing the dishes could put on the equipment necessary to record all their movements in a way that is immediately operable by the robot of the same form factor. Autopilot is a great example of this gradual approach. Autopilot was first released with mediocre performance and the human driver did most of the work. Slowly over time, the human intervention data is collected into a massive dataset that continuously improves Autopilot. We could imagine a similar feedback loop of human intervention <> humanoid robot.
If we achieve the holy grail of a pre-trained robot that can learn how to perform new tasks from a video of a human, then we may have a chance of breaking through the wall of the Bitter Lesson. We might then have billions of people generating useful data for robots every hour of every day simply by going about their daily lives in the physical world.
Of course, a humanoid form factor isn't likely to be optimal for performing every task, but ultimately the hardware that unlocks a flood of data will be the one that's most useful.
One of the pitfalls AI must avoid is the allure of replacing a small number of minimum wage jobs. The economics don't work out. If it takes $20B of R&D to make an intelligent robot product, how do we ever expect to make that back by replacing a small number of $14/hr jobs? Early deep RL was extremely good at playing video games and many people wanted to sell it to video game companies for QA testing. This is a dreadful business idea because there are millions of people willing to playtest unreleased video games for little more than a donut and some swag. Likewise, I'm uncertain how economical it is to focus on replacing low-wage, low-skill factory line work. I don't know what business ambitions Tesla Bot and other humanoid robots have.
The above-outlined approach is by no means a straight shot to success. It remains to be seen that the Bitter Lesson can finally be overcome with sufficient amounts of data of sufficient quality on real-world actions. It remains to be seen that in-context learning successfully solves a new task on a pre-trained robot. It's always very difficult to predict timelines, but I'm optimistic this cycle in robotics will produce some useful and scalable applications.