Skip to content

System Design Description

The image below describes our robot hardware in a test environment setup. The overall system comprises the Stretch RE1 platform as the primary robot. Stretch platform is already equipped with many sensors including Intel RealSense D435i RGB-D camera, torque sensors, 9 DOF IMU, RP-Lidar A1, and ReSpeaker V2 microphone array in the head. These sensors combined with dexterous mobile base and manipulator support development of various subsystems. For HRI capabilities, an external microphone is mounted on the base of the robot. Additionally, an iPad is mounted on the head of the robot for displaying the robot’s face. Our overall software architecture comprises several different subsystems operating in tandem, coordinated by 4 different layers of Finite State Machines (FSM). Level 1: A high-level mission planner coordinates the global status of tasks that are executed. Level 2: Each subsystem has its own local FSM that is used to execute sub-tasks. Level 3: Our manipulation subsystem has several complex algorithms that require their own FSMs executing within Level 2 FSMs. Level 4: All of our subsystems are orchestrated by action servers which are based on a low-level FSM.

Fig. 1: Robot hardware in a test environment

The following subsections describes the major subsections briefly.

Mission Planner

The goal of the Mission Planner is to do integration of three primary subsystems of our robot: HRI, Navigation, and Manipulation. It acts as a watchdog and observes the state of the robot, stores useful data, and triggers various actions utilizing a higher-level Finite State Machine (FSM). We use a single script in order to launch our entire robot in a coordinated manner. When launched, our custom software automatically ensures robot calibration, runs system checks to ensure all sensors are operating nominally, and ensures that all ROS nodes are running nominally before the robot is ready to receive commands Additionally, all the perception-related computation was performed on a remote GPU device, called the Brain. For low-level robot control, we have developed our own wrapper around the Stretch RE1’s low-level controller, called Alfred Driver.


Human-Robot Interaction

The HRI subsystem has four major tasks:

  • Trigger word detection
  • Speech-to-text transcription
  • Basic social interaction
  • Telepresence capability

We used Picovoice for trigger word detection and Google Cloud’s speech-to-text API for parsing speech. Additionally, we integrated ChatGPT into the HRI subsystem to provide certain basic social engagement capability. Whenever, the user says the trigger word, “Hey, Alfred!”, it triggers the speech-to-text API, and subsequent speech is parsed and a set of potential robot actions are generated. If the task is identified as a pick-and-place type task, mission planner launches sequences of actions to successfully execute the task, otherwise ChatGPT generates a response and robot socially interacts with the user. Finally, we use DeepMind’s Neural2 API for generating life-like voice responses as a feedback to the user when it appropriately registers a request.

The HRI subsystem also provides teleoperation capabilities to the robot through a handheld interface and a display screen (iPAD) mounted on the robot. We implement video calling using Agora API and use firebase cloud service for wireless communication.


Navigation

The goal of the navigation subsystem is to move the robot in the operating environment from an initial location to the desired location of the object, and back to the initial location. At a high-level, navigation subsystem performs functions such as Mapping, Localization, Planning, Control and Actuation. Navigation requires a 2d map of the environment which was generated by running SLAM using gmapping package in ROS. For localization, Adaptive Monte Carlo Localization algorithm, implemented in the amcl package in ROS Noetic was used. We fuse robot’s odometry and LiDAR data to get a refined estimate. Motion planning algorithms were implemented using ROS movebase library. We used Dijkstra’s algorithm (Navfn ROS) for global path planning and the
Dynamic Window Approach (DWA) algorithm for local path planning and controls. Additionally, we developed a custom plugin that commands the robot to move back by a set distance and replan a path, in cases where the default Rotate Recovery and Moveback Recovery fails.


Manipulation

Manipulation subsystem is triggered when the navigation is completed, and its goal is to successfully pick the desired object and place at the desired destination. Manipulation subsystem works in the following sequence of actions.

  • Visual Servoing: The visual servoing algorithm detects the desired object in camera frame and automatically aligns the robot to a graspable position. Broadly, the head camera assembly scans the environment and finds the object running a finetuned YOLOv8 object detector model. Once an object is found, we filter improbable instances of the object based on distance. Finally, we run a set of maneuvers to align the robot in an appropriate grasping position and orientation based on the maximum graspable distance.
  • Grasp generation: In this step, we generate grasp poses using GraspNet, which is a deep learning model trained on large database of point clouds for various household objects. We rank the grasps based on some heuristics, and if grasps seems infeasible we revert to the default behaviour of median centered grasping, where we find the median of object point cloud and use it as the grasp center.
  • Plane Detection and Placement: We use a plane detection algorithm based on RANSAC to identify table planes while picking the object. This is useful to estimate height at which the object is grasped from the table. Finally, we use a contact-sensing based placement scheme that automatically sense when the object has been placed on the table and release the object from the end-effector.
  • Grasp Success Validation: To verify if the grasp is successful, we additionally trained a logistic regression model which classifies ‘success’ or ‘fail’ based on the gripping effort. We collected a small dataset of diverse objects and gripping configurations and trained a model based on it.