Hardware
Low-Level Sensor integration
Thermal Cameras: The drone uses 2 FLIR Boson cameras giving us 640×512 resolution. We wrote our own custom driver using V4L2 to allow for quick processing and efficient memory management. The driver also allows the thermal camera to be set to disabled, master, or slave mode and this changes the timing characteristics of the cameras. We have also developed a thermal preprocessing method that takes into account thermal-sensor noise adjustment using FFC, gain correction, histogram equalization, and noise estimation.
RealSense 465: This is to measure ground truth and provide some depth estimation when flying outdoors or in environment where visual features can be leveraged. The RealSense has 2 imagers, 1 color sensor and 1 IR projector along with an internal IMU.
IMU: The drone relies on 2 Inertial Measurement Units – Epson G365 external IMU and one that is included with the RealSense 465. The team has worked towards improving the Epson IMU driver code as by-product of testing and debugging issues with it.
Mechanical Design
Mounts are being designed for the two thermal cameras, while keeping in mind the FOV of the thermal cameras and protection from crashes.
Our design, shown below, places the cameras within the convex hull of the drone’s body, thus making it safer during crashes as it will not be first to strike the ground. The design also keeps the propellers outside of the FOV of the cameras. The mounting positions ensure that the thermal camera mounts are single fault tolerant, in the sense that every degree of freedom is secured in at least two mounting directions in case any one of them become loose.

Considering rapid iteration as we decide on our sensor payload and drone design, major subassemblies of the drone are designed for not only robustness but modularity and ease of manufacturing and assembly.


The integrated hardware for the drone includes:
- Nvidia AGX Orin for compute,
- 2 Flir Boson cameras for thermal vision,
- Intel Realsense D456 for RGBD vision,
- LiPo battery for power,
- Garmin Lidar Lite v3 for altitude sensing, and
- Pixracer Pro for flight controller,
All built on a Hexsoon EDU-650 airframe.

Electrical
The drone now boasts 2 Flir Boson Cameras, an Intel RealSense, Epson IMU, and the Nvidia Jetson AGX Orin. Interfacing and synchronizing the sensor stack with the compute and the flight controller is complicated and has to be meticulously designed. We have achieved sensor interfacing and integration and we document all the components of our subsystems with wiring diagrams and takeaways. Below is an example for how our sensors interface with the Nvidia Orin’s GPIO pins.

Additionally, we have also encountered custom breakout boards designed by alumni which there is no documentation on when dealing with the FLIR Thermal cameras and the Epson IMU G365. Understanding their work and documenting it allows us to debug issues we are facing. Currently, the IMU gets stuck on initialization and does not consistently publish messages. Looking at the breakout board designed we were able to eliminate power supply as a cause for the problem.

Odometry
The Odometry subsystem is responsible for predicting the drone’s change in pose over time. Throughout the semester, we explored and tuned a variety of algorithms for this purpose. Following advice from SLAM experts at Airlab, Shibo Zhao and Parv Maheshwari, we started by looking at Robust Thermal Visual Inertial Odometry (ROVTIO) and Uncertainty Aware Multi-Spectral Inertial Odometry (MSO) (authored at AirLab).
ROVTIO is an EKF-based odometry method that stores features for a short duration. Since it is EKF-based, it does not have a concept of loop closures and thus, does not benefit from revisiting old points. ROVTIO uses traditional feature detectors and is purely a classical approach. Since it is an EKF based approach, it does not have many parameters and thus can be tuned to work with any visual or inertial sensor relatively quickly.
MSO uses a factor graph-based approach which initializes variables and measurements and then optimizes to satisfy all constraints. It identifies features based on ThermalPoint, a feature detector that works on 16 bit thermal images. MSO is flexible enough that it can work with a vide variety of sensors but we are attempting to use it with one of the two mounted thermals or a dual-mono setup.
We initially benchmarked both these methods extensively on outdoor datasets (Caltech Aerial Dataset) and in large indoor spaces (SubT-MSO). Through these tests, we were able to glean a much more thorough understanding of both algorithms and their strengths.

However, when we tried implementing these algorithms on the Phoenix Pro system, the biggest challenge we faced with above methods were identifying the coordinate frames and how they are oriented. Although both algorithms accept IMU-thermal camera extrinsics, they have certain undocumented assumptions that makes it difficult to configure the hardware with the software effectively. For example, MSO accepts a gravity vector as a config parameter but the logic embedded in the code assumes gravity points in the negative z direction.
Following suggestions from our mentor at Airlab, we also began experimenting with Metrics-Aware Covariance for Learning-based Stereo Visual Odometry (MAC-VO), another odometry approach developed at Airlab based on estimating 2D covariances from a learning-based model and projecting these 2D covariances into 3D to determine optimal keypoints for feature tracking. We found significantly more success with MAC-VO and decided on this algorithm as the method of odometry for the spring semester.

To maximize MAC-VO’s performance as a visual odometry algorithm, we enhanced the testing scene with multiple thermal features such as humans, hand warmers, lights, and others, and made sure to traverse the environment slowly such that the algorithm could match as many keypoints possible across consecutive timeframes.
MAC-VO also supports swapping out the underlying modules that it uses for covariance estimation, keypoint selection, and pose graph optimization in a modular manner. Striving for a balance of performance and speed, the final configuration we went with, after much testing, was FlowFormer with covariance estimator for the frontend model, covariance aware keypoint selection, and two frame optimizer minimizing reprojection error.
Additionally to optimize speed, we parallelized running the frontend model on the Orin’s GPU and the optimizer on the CPU, as well as initializing MAC-VO’s underlying CUDA Graph Optimizer independently before initializing the Depth Estimation subsystem so that the CUDA optimizer would not capture any unnecessary inferences from the depth estimation model, FoundationStereo. The algorithm runs at about 1.6Hz independently onboard on the Orin AGX and 0.8Hz when run in conjunction with FoundationStereo.
Dense Depth Estimation from Thermal Cameras
Our original depth estimation subsystem uses MoGe to estimate relative depth and MadPose to compute metric depth. The operational workflow is as follows:

- Time Synchronization: Captured images from the stereo thermal setup are time-synchronized using a hardware pulse.
- Preprocessing: The 16-bit depth data is preprocessed to optimize feature extraction.
- Relative Depth Estimation: The MoGe algorithm generates relative depth maps for both the left and right thermal cameras.
- Metric Depth Computation: The MadPose optimizer computes metric depth using the relative depth outputs from both cameras, leveraging camera extrinsic parameters.

A problem we encountered while using this approach is that the scale varied significantly across consecutive time steps, and as such we were not able to recover consistent local point clouds over time. We theorize that this was because it is an ill-posed problem because the solver attempts to solve not only for the variables mentioned above, but also for the extrinsics. This means that the system can have multiple solutions, and thus, is not a well-posed problem.
We also experimented with cutting-edge machine-learning research for 3D reconstruction such as Mast3r and other algorithms. We did not find success with these approaches either, we theorize that these approaches rely heavily on the semantic appearance of objects in RGB images and utilize priors about their relative positions and sizes to perform 3D reconstruction. These priors do not manifest in thermal images, thus, these approaches trained on RGB images, fail.


The lack of semantic priors in thermal images meant that we had to try classical methods that rely on image intensity features. Classically, depth is calculated from a stereo pair of images by estimating disparity along epipolar lines. In layman terms, nearer objects appear father away in the two images. The distance between the projections of a world point in both images is called disparity. The distance of the object is inversely proportional to disparity. This is what we use to find real-world distances.
Classically, disparity is found using a sliding window approach to find where a pixel and its neighborhood appears in the other image. This location is determined by minimizing the intensity distance between the neighborhood and all candidates in the other image.
This approach did not work with thermal images, because the images are extremely noisy – the images contain time-varying salt-and-pepper noise, which throws off any attempts to compare neighborhoods of pixels.
We were advised to use FoundationStereo for this task of estimating disparity between the images. This is learning-based method that takes in two images and outputs a disparity image. This algorithm has several advantages over the other learning-based approaches attempted above:
- FoundationStereo is trained on multispectral images thus it has generalized to multiple image modalities, including thermal.
- This method does not rely on semantic priors as the other ones did. This is because to calculate disparity, semantic information is not needed, all the information about the geometry is encoded in the relationships between the pixels of both images.
We obtained disparity values from this method and scaled these disparity values to a metric pointcloud as shown in Figure~\ref{fig:foundation-stereo}.
The original model is in trained and inferred in PyTorch. We used TensorRT to accelerate and serialize the model for inference on our image size and on our hardware. We use this serialized model to perform quick inference and are able to publish depth pointclouds at 1Hz.

Mapping
Mapping is the task of creating and updating an accurate 3D map of the environment.
After attempting other approaches, we settled on an occupancy-map voxel based approach provided by the Octomap ROS node. This node ingests the drone – world origin transformation and the current depth point cloud in the drone’s frame to incrementally update the global map.
The global map is comprised of voxels of a fixed size, which all store the probability that the space in that voxel is occupied. The mapping node also publishes a 2D occupancy map of the environment that displays unoccupied voxels where firefighters can move and occupied voxels which have objects/walls.
An image of our final mapping capabilities are shown below:

Networking
The Networking subsystem handles wireless transmission of data from the drone to the operators at the Ground Control Station.
Hardware
We have at our disposal:
- The Orin’s WiFi
- GCS Air Unit (on board drone)
- GCS Ground Unit (on ground)
- Control Computer
Original Architecture
We have two alternative networking solutions:
Orin Wifi
The Orin publishes a WiFi Hotspot using it’s own hardware and antennas attached to it.
This approach is unstable and fragile, wifi strength drops quickly with distance.
We use this for quick and dirty communication, where we expect disconnects.
GCS Air to Ground LAN
The GCS enables us to bridge a LAN connection between the GCS ground and air units.
We extend this to the Control Computer and the Orin, respectively.
We set routes so the Orin and Control Computer exist on the same subnet and route messages via the Air Unit LAN so they can be seen at both ends.
Video Transmission
Sending video frame-by-frame is inefficient.
We encode the video into a stream using h.264 and then send transmitted data over the network.
This results in lesser data being sent, as encoded video avoids transmitting redundant information.
At the receiver, we decode the stream and publish it to an Image topic, so that ROS tools like RViz can view the images.
Final Architecture
Initially, we planned to handle all of our network communication over the radio connection of our Herelink Ground Control Station (GCS) unit. However, we soon realized this wasn’t feasible due to significant network overhead and the limited bandwidth of only 17 Mbps. Therefore, we decided to decouple the control and data transmission pipelines while leveraging the optimized video transmission link available on Herelink. Our final system design is shown in the following diagram:

Our final networking system had the following operational flow:
- Data Transmission: To leverage the optimized video link available on Herelink, we displayed all important data on the Orin drone compute’s screen using a custom Visualizer Node and Rviz. We then forwarded its display port to the Herelink GCS ground unit. This allowed a direct video stream from Orin to our ground computer at 30 frames per second with 1080p quality.
- Flight Data Transmission: The Herelink air unit was connected to the flight controller, which relayed status information throughout the flight, ensuring safety.
- Control: We set up an external Local Area Network (LAN) using a router to gain direct access to Orin via VNC. This allowed us to achieve real-time control over the drone’s computing system.
