Teaching robots to see with Unity

March 2, 2021 Sarah Wolf, Jonathan Leban, Amanda Trang, Jacob Platin, Sujoy Ganguly

The world of robotics is full of unknowns! From sensor noise to the exact positioning of important objects, robots have a critical need to understand the world around them to perform accurately and robustly. We previously demonstrated a pick-and-place task in Unity using the Niryo One robot to pick up a cube with a known position and orientation. This solution would not be very robust in the real world, as precise object locations are rarely known a priori. In our new Object Pose Estimation Demo, we show you how to use the Unity Computer Vision Perception Package to collect data and train a deep learning model to predict the pose of a given object. We then show you how to integrate the trained model with a virtual UR3 robotic arm in Unity to simulate the complete pick-and-place system on objects with unknown and arbitrary poses.

Robots in the real world often operate in and must adapt to dynamic environments. Such applications often require robots to perceive relevant objects and interact with them. An important aspect of perceiving and interacting with objects is understanding their position and orientation relative to some coordinate system, also referred to as their “pose.” Early pose-estimation approaches often relied on classical computer vision techniques and custom fiducial markers. These solutions are designed to operate in specific environments, but often fail when their environments change or diverge from the expected. The gaps introduced by the limitations of traditional computer vision are being addressed by promising new deep learning techniques. These new methods create models that can predict the correct output for a given input by learning from many examples.

This project uses images and ground-truth pose labels to train a model to predict the object’s pose. At run time, the trained model can predict an object’s pose from an image it has never seen before. Usually, tens of thousands or more images need to be collected and labeled for the deep learning models to perform sufficiently. Real-world collection of this data is tedious, expensive, and, in some cases like 3D object localization, inherently difficult. Even when this data can be collected and labeled, the process can turn out to be biased, error-prone, tedious, and expensive. So how do you apply powerful machine learning approaches to your problem when the data you want is out of reach or doesn’t actually exist for your application yet?

Unity Computer Vision allows you to generate synthetic data as an efficient and effective solution for your machine learning data requirements. This example shows how we generated automatically labeled data in Unity to train a machine learning model. This model is then deployed in Unity on a simulated UR3 robotic arm using the Robot Operating System (ROS) to enable pick-and-place with a cube that has an unknown pose.

Generating Synthetic Data

Close Up of Randomly Generated Poses and Environment Lighting

Simulators, like Unity, are a powerful tool to address challenges in data collection by generating synthetic data. Using Unity Computer Vision, large amounts of perfectly labeled and varied data can be collected with minimal effort, as previously shown. For this project, we collect many example images of the cube in various poses and lighting conditions. This method of randomizing aspects of the scene is called domain randomization1. More varied data usually leads to a more robust deep learning model.

To collect data with the cube in various poses in the real world, we would have to manually move the cube and take a picture. Our model used over 30,000 images to train, so if we could do this in just 5 seconds per image, it would take us over 40 hours to collect this data! And that time doesn’t include the labeling that needs to happen. Using Unity Computer Vision, we can generate 30,000 training images and another 3,000 validation images with corresponding labels in just minutes! The camera, table, and robot position are fixed in this example, while the lighting and cube’s pose vary randomly in each captured frame. The labels are saved to a corresponding JSON file where the pose is described by a 3D position (x,y,z) and quaternion orientation (qx,qy,qz,qw). While this example only varies the cube pose and environment lighting, Unity Computer Vision allows you to easily add randomization to various aspects of the scene. To perform pose estimation, we use a supervised machine learning technique to analyze the data and generate a trained model.

Using Deep Learning to Predict Pose

Deep Learning Model Architecture for Pose Estimation

In supervised learning, a model learns how to predict a specific outcome based on training a set of inputs and corresponding outputs, images, and pose labels in our case. A few years ago, a team of researchers presented2 a convolutional neural network (CNN) that could predict the position of an object. Since we are interested in a 3D pose for our cube, we extended this work to include the cube’s orientation in the network’s output. To train the model, we minimize the least squared error, or L2 distance, between the predicted pose and the ground-truth pose. After training, the model predicted the cube’s location within 1cm and the orientation within 2.8 degrees (0.05 radians). Now let’s see if this is accurate enough for our robot to successfully perform the pick-and-place task!

Motion Planning in ROS

Pose Estimation Workflow

The robot we are using in this project is a UR3 robotic arm with a Robotiq 2F-140 gripper, which was brought into our Unity scene using the Unity Robotics URDF Importer package. To handle communication, the Unity Robotics ROS-TCP Connector package is used while the ROS MoveIt package handles motion planning and control.

Now that we can accurately predict the pose of the cube with our deep learning model, we can use this predicted pose as the target pose in our pick-and-place task. Recall that in our previous Pick-and-Place Demo, we relied on the ground-truth pose of the target object. The difference here is that the robot performs the pick-and-place task with no prior knowledge of the cube’s pose and only gets a predicted pose from the deep learning model. The process has 4 steps:

  1. An image with the target cube is captured by Unity
  2. The image is passed to a trained deep learning model, which outputs a predicted pose
  3. The predicted pose is sent to the MoveIt motion planner
  4. ROS returns a trajectory to Unity for the robot to execute in an attempt to pick up the cube

Each iteration of the task sees the cube moved to a random location. Although we know the cube’s pose in simulation, we will not have the benefit of this information in the real world. Thus, to lay the groundwork for transferring this project to a real robot, we need to determine the cube’s pose from sensory data alone. Our pose estimation model makes this possible and, in our simulation testing, we can reliably pick up the cube 89% of the time in Unity!

Conclusion

Our Object Pose Estimation Demo shows how Unity gives you the capability to generate synthetic data, train a deep learning model, and use ROS to control a simulated robot to solve a problem. We used the Unity Computer Vision tools to create synthetic, labeled training data and trained a simple deep learning model to predict a cube’s pose. The demo provides a tutorial walking you through how to recreate this project, which you can expand by applying more randomizers to create more complex scenes. We used the Unity Robotics tools to communicate with a ROS inference node that uses the trained model to predict a cube’s pose. These tools and others open the door for you to explore, test, develop, and deploy solutions locally. When you are ready to scale your solution, Unity Simulation saves both time and money compared to local systems.

And did you know that both Unity Computer Vision and Unity Robotics tools are free to use!? Head over to the Object Pose Estimation Demo to get started using them today!

Keep Creating

Now that we can pick up objects with an unknown pose, imagine how else you could expand this! What if there are obstacles in the way? Or multiple objects in the scene? Think about how you might handle this, and keep an eye out for our next post!

Can’t wait until our next post!? Sign up to get email updates about our work in robotics or computer vision.

You can also find more robotics projects on our Unity Robotics GitHub.

For more computer vision projects, visit our Unity Computer Vision page.

Our team would love to hear from you if you have any questions, feedback, or suggestions! Please reach out to unity-robotics@unity3d.com.

Citations

  1. J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, P. Abbeel, “Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World” arXiv:1703.06907, 2017
  2. J. Tobin, W. Zaremba, and P. Abbeel, “Domain randomization and generative models for robotic grasping,” arXiv preprint arXiv:1710.06425, 2017
Next Article
Real-time style transfer in Unity using deep neural networks
Real-time style transfer in Unity using deep neural networks

In this post, we experiment with a challenging use case: multi-style in-game style transfer.

×

Download retail sample dataset today

First Name
Last Name
Company
Job Title
Industry
Country
I understand that by checking this box, I agree to have Marketing Activities directed to me by Unity
I acknowledge the Unity Privacy Policy
I have read and agree to the Unity Terms of Service
Thank you!
Error - something went wrong!
×

Tell us about you to gain full content access

First Name
Last Name
Company
Job Title
Industry
Country
State
Province
I want to receive communication from Unity.
I have read and agree to the Unity Terms of Service.
I acknowledge the Unity Privacy Policy.
Thank you!
Error - something went wrong!