Covariant Uses Simple Robot and Gigantic Neural Net to Automate Warehouse Picking

Covariant Uses Simple Robot and Gigantic Neural Net to Automate Warehouse Picking

Two years ago, we wrote about an AI startup from UC Berkeley and OpenAI called Embodied Intelligence, founded by robot laundry-folding expert Pieter Abbeel. What exactly Embodied was going to do wasnt entirely clear, and honestly, it seemed like Embodied itself didnt really knowthey talked about building technology that enables existing robot hardware to handle a much wider range of tasks where existing solutions break down, and gave some examples of how that might be applied (including in manufacturing and logistics), but nothing more concrete.

Since then, a few things have happened. Thing one is that Embodiedis now Thing two is that spent almost a year talking with literally hundreds of different companies about how smarter robots could potentially make a difference for them. These companies represent sectors that include electronics manufacturing, car manufacturing, textiles, bio labs, construction, farming, hotels, elder carepretty much anything you could think about where maybe a robot could be helpful, Pieter Abbeel tells us. Over time, it became clear to us that manufacturing and logistics are the two spaces where theres most demand now, and logistics especially is just hurting really hard for more automation. And the really hard part of logistics is what Covariant decided to tackle.

Theres already a huge amount of automation in logistics, but as Abbeel explains, in warehouses there are two separate categories that need automation: The things that people do with their legs and the things that people do with their hands. The leg automation has largely been taken care of over the last five or 10 years through a mixture of conveyor systems, mobile retrieval systems, Kiva-like mobile shelving, and other mobile robots. The pressure now is on the hand part, Abbeel says. Its about how to be more efficient with things that are done in warehouses with human hands.

A huge chunk of human-hand tasks in warehouses comes down to picking. That is, taking products out of one box and putting them into another box. In the logistics industry, the boxes are usually called totes, and each individual kind of product is referred to by its stock keeping unit number, or SKU. Big warehouses can have anywhere from thousands to millions of SKUs, which poses an enormous challenge to automated systems. As a result, most existing automated picking systems in warehouses are fairly limited. Either theyre specifically designed to pick a particular class of things, or they have to be trained to recognize more or less every individual thing you want them to pick. Obviously, in warehouses with millions of different SKUs, traditional methods of recognizing or modeling specific objects is not only impractical in the short term, but would also be virtually impossible to scale.

This is why humans are still used in pickingwe have the ability to generalize. We can look at an object and understand how to pick it up because we have a lifetime of experience with object recognition and manipulation. Were incredibly good at it, and robots arent. From the very beginning, our vision was to ultimately work on very general robotic manipulation tasks, says Abbeel. The way automations going to expand is going to be robots that are capable of seeing whats around them, adapting to whats around them, and learning things on the fly.

Covariant is tackling this with relatively simple hardware, including an off-the-shelf industrial arm (which can be just about any arm), a suction gripper (more on that later), and a straightforward 2D camera system that doesnt rely on lasers or pattern projection or anything like that. What couples the vision system to the suction gripper is one single (and very, very large) neural network, which is what helps Covariant to be cost effective for customers. We cant have specialized networks, says Abbeel. It has to be a single network able to handle any kind of SKU, any kind of picking station. In terms of being able to understand whats happening and whats the right thing to do, thats all unified. We call it Covariant Brain, and its obviously not a human brain, but its the same notion that a single neural network can do it all.

We can talk about the challenges of putting picking robots in warehouses all day, but the reason why Covariant is making this announcement now is because their system has been up and running reliably and cost effectively in a real warehouse in Germany for the last four months.

This video is showing Covariants robotic picking system operating (for over an hour at 10x speed) in a warehouse that handles logistics for a company called Obeta, which overnights orders of electrical supplies to electricians in Germany. The robots job is to pick items from bulk storage totes, and add them to individual order boxes for shipping. The warehouse is managed by an automated logistics company called KNAPP, which is Covariants first partner. We were searching a long time for the right partner, says Peter Puchwein, vice president of innovation at KNAPP. We looked at every solution out there. Covariant is the only one thats ready for real production.He explains that Covariants AI is able to detect glossy, shiny, and reflective products, including products in plastic bags. The product range is nearly unlimited, and the robotic picking station has the same or better performance than humans.

The key to being able to pick such a wide range of products so reliably, explains Abbeel, is being able to generalize. Our system generalizes to items its never seen before. Being able to look at a scene and understand how to interact with individual items in a tote, including items its never seen beforehumans can do this, and thats essentially generalized intelligence, he says.This generalized understanding of whats in a bin is really key to success. Thats the difference between a traditional system where you would catalog everything ahead of time and try to recognize everything in the catalog, versus fast-moving warehouses where you have many SKUs and theyre always changing. Thats the core of the intelligence that were building.

To be sure, the details on how Covariants technology work are still vague, but we tried to extract some more specifics from Abbeel, particularly about the machine learning components. Heres the rest of our conversation with him:

IEEE Spectrum: How was your system trained initially?

Pieter Abbeel: We would get a lot of data on what kind of SKUs our customer has, get similar SKUs in our headquarters, and just train, train, train on those SKUs. But its not just a matter of getting more data. Actually, often theres a clear limit on a neural net where its saturating. Like, we give it more data and more data, but its not doing any better, so clearly the neural net doesnt have the capacity to learn about these new missing pieces. And then the question is, what can we do to re-architect it to learn about this aspect or that aspect that its clearly missing out on?

Youve done a lot of work on sim2real transferdid you end up using a bajillion simulated arms in this training, or did you have to rely on real-world training?

We found that you need to use both. You need to work both in simulation and the real world to get things to work. And as youre continually trying to improve your system, you need a whole different kind of testing: You need traditional software unit tests, but you also need to run things in simulation, you need to run it on a real robot, and you need to also be able to test it in the actual facility. Its a lot more levels of testing when youre dealing with real physical systems, and those tests require a lot of time and effort to put in place because you may think youre improving something, but you have to make sure that its actually being improved.

What happens if you need to train your system for a totally new class of items?

The first thing we do is we just put new things in front of our robot and see what happens, and often itll just work. Our system has few-shot adaptation, meaning that on-the-fly, without us doing anything, when it doesnt succeed itll update its understanding of the scene and try some new things. That makes it a lot more robust in many ways, because if anything noisy or weird happens, or theres something a little bit new but not that new, you might do a second or third attempt and try some new things.

But of course, there are going to be scenarios where the SKU set is so different from anything its been trained on so far that some things are not going to work, and well have to just collect a bunch of new datawhat does the robot need to understand about these types of SKUs, how to approach them, how to pick them up. We can use imitation learning, or the robot can try on its own, because with suction, its actually not too hard to detect if a robot succeeds or fails. You can get a reward signal for reinforcement learning. But you dont want to just use RL, because RL is notorious for taking a long time, so we bootstrap it off some imitation and then from there, RL can complete everything.

Why did you choose a suction gripper?

Whats currently deployed is the suction gripper, because we knew it was going to do the job in this deployment, but if you think about it from a technological point of view, we also actually have a single neural net that uses different grippers. I cant say exactly how its done, but at a high level, your robot is going to take an action based on visual input, but also based on the gripper thats attached to it, and you can also represent a gripper visually in some way, like a pattern of where the suction cups are. And so, we can condition a single neural network on both what it sees and the end-effector it has available. This makes it possible to hot-swap grippers if you want to. You lose some time, so you dont want to swap too often, but you could swap between a suction gripper and a parallel gripper, because the same neural network can use different gripping strategies.

And I would say this is a very common thread in everything we do. We really wanted to be a single, general system that can share all its learnings across different modalities, whether its SKUs, end of arm tools, different bins you pick from, or other things that might be different. The expertise should all be sharable.

And one single neural net is versatile enough for this?

People often say neural networks are just black boxes and if youre doing something new you have to start from scratch. Thats not really true. I dont think whats important about neural nets is that theyre black boxesthats not really where their strength comes from. Their strength comes from the fact that you can train end-to-end, you can train from input to the desired output. And you can put modular things in there, like neural nets that are an architecture thats well suited to visual information, versus end-effector information, and then they can merge their information loads to come to a conclusion. And the beauty is that you can train it all together, no problem.

When your system fails at a pick, what are the consequences?

Heres where things get very interesting. You think about bringing AI into the physical worldAI has been very successful already in the digital world, but the digital world is much more forgiving. Theres a long tail of scenarios that you could encounter in the real world and you havent trained against them, or you havent hardcoded against them. And thats what makes it so hard and why you need really good generalization including few-shot adaptation and so forth.

Now lets say you want a system to create value. For a robot in a warehouse, does it need to be 100 percent successful? No, it doesnt. If, say, it takes a few attempts to pick something, thats just a slowdown. Its really the overall successful picks per hour that matter, not how often you have to try to get those picks. And so if periodically it has to try twice, its really the picking rate thats affected, not the success rate thats affected. A true failure is one where human intervention is needed.

With true failures, where after repeated attempts the robot just cant pick an item, well get notified by that and we can then train on it, and the next day it might work, but at that moment it doesnt work. And even if a robotic deployment works 90 percent of the time, thats not good enough. A human picking station can range from 300 to 2000 picks per hour. 2000 is really rare and is peak pick for a very short amount of time, so if we look at the bottom of that range, 300 picks per hour, if were succeeding 90 percent, that means 30 failures per hour. Wow, thats bad. At 30 fails per hour, fixing those up by a human probably takes more than an hours worth of work. So what youve done now is youve created more work than you save, so 90 percent is definitely a no go.

At 99 percent thats 3 failures per hour. If it usually takes a couple of minutes for a human to fix, at that point, a human could oversee 10 stations easily, and thats where all of a sudden were creating value. Or a human could do another job, and just keep an eye on the station and jump in for a moment to make sure it keeps running. If you had a 1000 per hour station, youd need closer to 99.9 percent to get there and so forth, but thats essentially the calculus weve been doing. And thats what you realize how any extra nine you want to get is so much more challenging than the previous nine youve already achieved.

Covariant founders
Photo:Elena Zhukova
Covariant co-founders (left to right): Tianhao Zhang, Rocky Duan, Peter Chen, and Pieter Abbeel.

There are other companies that are developing using similar approaches to pickingindustrial arms, vision systems, suction grippers, neural networks. What makes Covariants system work better?

I think its a combination of things. First of all, we want to bring to bear any kind of learningimitation learning, supervised learning, reinforcement learning, all the different kinds of learning you can. And you also want to be smart about how you collect datawhat data you collect, what processesyou have in place to get the data that you need to improve the system. Then related to that, sometimes its not just a matter of data anymore, its a matter of, you need to re-architect your neural net. A lot of deep learning progress is made that way, where you come up with new architectures and the new architecture allows you to learn something that otherwise would maybe not be possible to learn. I mean, its really all of those things brought together that are giving the results that were seeing. So its not really like any one that can be singled out as this is the thing.

Also, its just a really hard problem. If you look at the amount of AI research that was needed to make this work... We started with four people, and we have 40 people now. About half of us are AI researchers, we have some world-leading AI researchers, and I think thats whats made the difference. I mean, I know thats whats made the difference.

So its not like youve developed some sort of crazy new technology or something?

Theres no hardware trick. And were not doing, I dont know, fuzzy logic or something else out of left field all of a sudden. Its really about the AI stuff that processes everythingunderneath it all is a gigantic neural network.

Okay, then how the heck are you actually making this work?

If you have an extremely uniquely qualified team and youve picked the right problem to work on, you can do something that is quite out there compared to what has otherwise been possible. In academic research, people write a paper, and everybody else catches up the moment the paper comes out. Weve not been doing thatso far we havent shared the details of what we actually did to make our system work, because right nowwe have a technology advantage. I think there will be a day when we will be sharing some of these things, but its not going to be anytime soon.

It probably wont surprise you that Covariant has been able to lock down plenty of funding (US $27 million so far), but whats more interesting is some of the individual investors who are now involved with Covariant, which include Geoff Hinton, Fei-Fei Li, Yann LeCun, Raquel Urtasun, Anca Dragan, Michael I. Jordan, Vlad Mnih, Daniela Rus, Dawn Song, and Jeff Dean.

While were expecting to see more deployments of Covariants software in picking applications, its also worth mentioning that their press release is much more general about how their AI could be used:

The Covariant Brain [is] universal AI for robots that can be applied to any use case or customer environment. Covariant robots learn general abilities such as robust 3D perception, physical affordances of objects, few-shot learning and real-time motion planning, which enables them to quickly learn to manipulate objects without being told what to do.

Today, [our] robots are all in logistics, but there is nothing in our architecture that limits it to logistics. In the future we look forward to further building out the Covariant Brain to power ever more robots in industrial-scale settings, including manufacturing, agriculture, hospitality, commercial kitchens and eventually, peoples homes.

Fundamentally, Covariant is attempting to connect sensing with manipulation using a neural network in a way that can potentially be applied to almost anything. Logistics is the obvious first application, since the value there is huge, and even though the ability to generalize is important, there are still plenty of robot-friendly constraints on the task and the environment as well as safe and low-impact ways to fail.As to whether this technology will effectively translate into the kinds of semi-structured and unstructured environments that have historically posed such a challenge for general purpose manipulation (notably, peoples homes)as much as we love speculating, its probably too early even for that.

What we can say for certain is that Covariants approachlooks promising both in its present implementation and its future potential, and were excited to see where they take it from here.

[ ]

Back to blog