3D object pose recognition framework for robot task based on RGB image

in the 14th Korea Robotics Society Annual Conference (KRoC 2019)


To bring collaborative robots into our everyday lives, the development of low-cost vision system is essential. In this work, we propose a vision framework that recognizes 6D pose of objects, i.e., position and orientation, from RGB images. The system is consisted of three main components: ROI Extraction, Keypoint Regression, and 6D Pose Estimation.


A. RGB-based Object Recognition

As shown in the figure below, we identified the locations of objects, regressed some keypoints, and applied PnP algorithm to find out 6D position of objects. To extract the region of interests, we used Mask R-CNN [1] to segment objects from RGB images. From the segmentation results, we estimated 6D pose of the target based on keypoints [2]. Here, the target can be selected by the user using the graphical user interface (GUI) of the system. First, we masked the target object to regress eight keypoints using Stacked Hourglass Network (SHN) [3]. Then, we computed 6D pose of the target by applying PnP algorithm to those keypoints. With the estimated 6D pose of the target object, we overlay its 3D model for qualitative evaluation of the algorithm.

An overview of 3D object pose recognition framework based on RGB image.

B. Automatic Data Acquisition System

For training of Mask R-CNN metioned in Section A, we developed our own data acquisition system to automatically collect large amount of data. We used a 6-DOF manipulator, i.e., IndyRP from Neuromeka, and attached a stereo camera on the end-effector for the data acquisition system. We predefined some positions to be uniformly distributed over the hemi-sphere around the object, and changed the light settings for the robust vision algorithm. With the known position of the object and the camera, we labeled the data using 3D model data and the pinhole camera model. As a result, we collected 22,176 images to train Mask R-CNN.


The accuracy of 6D pose estimation

The accuracy of our vision system is shown in the table above. Within the error margin of 5cm 5 degrees, we achieved over 60% accuracy in three objects. For pick-and-place tasks, we applied the error margin of 3cm 10 degrees and the system has shown over 90% accuarcy in three objects. Within this margin, the robot is capable of picking up the object and placing it on the marker successfully, as shown in the figure below. The reason why hand cream has shown considerably low accuracy is that it is symmetric, and we would need to find the solution to deal with objects that are symmetric or have little features.

IndyRP performing the pick-and-place tasks. It detects 6D pose of the target, picks it up using the end-effector, and places it on the marker.


[1] HE, Kaiming, et al. Mask r-cnn. In: Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017. p. 2980-2988.

[2] PAVLAKOS, Georgios, et al. 6-dof object pose from semantic keypoints. In: Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017. p. 2011-2018.

[3] NEWELL, Alejandro; YANG, Kaiyu; DENG, Jia. Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision. Springer, Cham, 2016. p. 483-499.