Perception Subsystem Progress

The perception subsystem had two primary objectives: implementing an obstacle detection model for navigation and developing a seedling detection model for manipulation.

May 1 2026

Pipe Detection Pipeline

The pot detection pipeline converts RGB-D camera input into a stable 3D pot position in the robot arm frame.

On each input image, YOLOv11s is run on the latest RGB image, and a confidence threshold of 0.7 is applied to discard low-quality detections. For every bounding box, the center pixel coordinate (u, v) is extracted and combined with the aligned depth image to recover the corresponding 3D position (x, y, z) in the camera frame.

When multiple pots are present in a single frame, the pipeline selects the candidate with the smallest depth value — that is, the pot closest to the camera — and passes it through a rolling average filter with a window size of 5 to reject noisy detections. Once a stable estimate is obtained, the 3D point is transformed into the xArm frame and published to a ROS topic, where the downstream manipulation node consumes it to compute the grasp pose. The grasp orientation itself is fixed and always oriented toward the pot.

Example Pot Detection Result

The pot detection model was trained on a custom dataset of 961 images, split into training, validation, and test sets at a 7:2:1 ratio. To improve robustness, the dataset spans both indoor and outdoor scenes captured under diverse lighting conditions. False negative samples were also incorporated into the training set to reduce missed detections in deployment.

The trained model is deployed on the Jetson Orin Nano in TensorRT format, where it achieves an inference latency of approximately 25 ms per frame.

The figure below shows the example pot detection result on the testing set.

April 3 2026

Dataset Curation

The data collection process involved capturing RGB and depth images in a ros2 bag. I collected pot data in both indoor and outdoor environments across three different data collection sessions. In total, the dataset contains 962 images.

Table 1 Distribution of indoor and outdoor data

Indoor, HighbayOutdoor
204758

Data Auto-labeling

I utilized the auto-labeling functionality in Roboflow to annotate bounding boxes around the target pots. Specifically, Roboflow leverages a powerful segmentation model, such as Segment Anything (SAM), to first generate a segmentation mask of the object and derive a tight bounding box from the mask. This approach produced highly accurate annotations with minimal manual labor.

After labeling, I exported the dataset, and split the whole dataset into training, validating, and testing sets, with a ratio of 7:2:1. 

Table 2 Distribution of training, validating, and testing data

TrainingValidatingTesting
67219298

YOLO Model training – Hyperparameter selection

Finally, I selected and tuned the appropriate hyperparameters for training. When selecting hyperparameters, I considered the specific characteristics of the use case. The target pots exhibit consistent color and shape, and the camera mounted on the robot arm is typically maintained in an upright orientation. As a result, extensive data augmentation—such as large color jittering, geometric distortion, or aggressive rotation—was deemed unnecessary. These augmentation parameters were therefore set to zero or relatively small values.

However, to improve robustness under varying environmental conditions, I incorporated augmentation strategies that simulate different lighting scenarios. This helps the model generalize better to changes in illumination that may occur during real-world deployment.

YOLO Model Training – Result Analysis

I trained three versions of the model. The first model was trained using YOLO11n. During evaluation, I observed that the model learned an bad feature: it tended to classify any rectangular object with a uniform color as a pot. As a result, the model performed poorly in indoor environments. Any black drawer in the background would be identified as a pot, with a high confidence score. Although the primary deployment scenario is outdoors, testing with the manipulation subsystem is initially conducted indoors, so the detection algorithm must also operate reliably in indoor settings.

One of the problems could be that I didn’t include any negative samples (i.e., images without pots) in the training set. This caused the model to overfit to simple visual patterns that frequently appeared in the dataset.

To address this issue, in the next training iteration:

  • I expanded the dataset by adding hard negative samples as well as additional indoor and outdoor images. This was intended to improve the model’s ability to distinguish pots from visually similar objects and increase its robustness across different environments.
  • I also disabled the color augmentation on the training images. It’s helpful for the model to learn the brown cardboard color.

The current model addresses these issues. We will test it out with manipulation together to see the future fine-tuning directions.

Fig. Example indoor pot detection result

Fig. Example outdoor pot detection result

March 2026

Seedling Detection -> Pot Detection

During this phase, we transitioned from using artificial samples to working with real seedlings, which will also be used in the System Validation Demo (SVD). After receiving the real seedlings, we identified that grasping the stem is not feasible. For small seedlings, the stems are fragile and cannot reliably support lifting. Additionally, lifting by the stem does not detach the plastic pot. To address this issue, we decided to replant the seedlings into biodegradable containers and shift the grasping target from the stem to the pot. The figure below shows some pot choices.

Figure 1 The real tree seedlings we plan to use for SVD. The left two seedlings are planted in biodegradable pot, the right one is planted in makeshift pot

Obstacle Detection

We selected Patchwork++ as the ground segmentation result. This method is well-suited for LiDAR-based perception and provides reliable separation of ground and non-ground points. The example ground segmentation result is shown in the below figure.

Figure 2 Example ground-segmentation result. The green dots are the ground points. The red dots are the non-ground points.

February 27 2026

Creating training and testing dataset for seedling detection model

We have started constructing a dataset for training and testing a YOLO-based seedling detection model. The dataset comprises three categories of images: (1) images of artificial seedlings captured indoors (2) images of artificial seedlings captured outdoors in Schenley Park, and (3) images obtained from a publicly available online dataset on Roboflow. All raw images were preprocessed through cleaning, labeling, and data augmentation.

February 13 2026

Secured components and conduct initial setup

A detailed work schedule with internal milestones to guide development has been created. After discussing outdoor SLAM LiDAR requirements with Neha, we selected the Velodyne VLP-16 (High-Resolution version) LiDAR for obstacle detection and the Intel RealSense D405 camera for tree seedling detection. We obtained the camera, LiDAR, data cables, and power supplies from the MRSD Lab and the Kantor Lab, and set up the ROS 2 software stack to collect sensor data.

Seedling detection task

We identified YOLO as a suitable model for seedling detection and set up a training pipeline on  my personal laptop. Our mentor from the Kantor Lab, Francisco, also explored the use of YOLOv8 for seedling detection. We got the training, validation, and testing datasets he created, which reduces the effort required to build a dataset from scratch. We’re going to build up our training and testing set and start conducting seedling detection model training.