Subsystem Description

Data Capture

The Data Capture subsystem aims to capture and use data from which we can extract realistic behavior for our models. It contains three major components. This includes the open driving data sets, the steering wheel based system and the traffic camera based videos.

We have expanded our horizon for realistic data and are focusing on all the above 3 sources. Originally we had only planned to use the data from Traffic camera based videos. But, as we delved deeper into the modelling subsystem and identified approaches to train the model, we realized the models would need a lot of training data for learning the behaviors. Based on initial estimates we would need around 5-6 hours of “good” training data(i.e. containing a good amount of interactions between vehicles). If we just capture 6 hours of real world video there is a high chance that only 1-2 hours from that footage is useful for training. It would also need a lot of variations in the training data to generalize and learn well. Capturing this scale of data(accounting for multiple rounds of data capture from multiple intersections) might not be achievable in the duration of the MRSD Project course. Also there is a high risk of the data not being very useful for the model if it doesn’t contain much variance.

Hence, we started looking at more sources of data that are easy to obtain and have a high amount of variance. Also, considering the pandemic situation of 2020, we have prioritized the other sources over the original Traffic camera based video capture to mitigate the risks of not being able to capture data. We still plan to evaluate and test our model performance on real data.

We can see an example data capture scenario in the below Figure. We capture data from the highlighted intersection.

Image of an intersection captured from CARLA

Image and Trajectory Processing

Once we have the recorded video of the intersections, the second major subsystem comes into play. The preprocessing subsystem uses the video and extracts useful information(such as position, orientation, velocity, state) about all the relevant actors(cars and pedestrians) and environment objects(traffic lights, signs etc).
The system will detect, track and store the states of each object across the given time range of the video.

The major steps within this sub system are described below,

    • Object Detection

      In this stage, we want to detect and classify all the relevant objects in the captured video. We a deep neural net based approach for the detection and classification of the objects as they give a higher accuracy that we need. We are using a Faster-RCNN model from the detectron2 framework which performs really well for our use-case. The model simultaneously give us the detection box(bounding box) and the class of the object. We used a model pre-trained on the MS-COCO dataset and fine tuned the model on images and groundtruth captured from CARLA. We used the same traffic-camera based data-capture pipeline discussed previously. We had 2688 images in the training set and 672 images in the test dataset.
      After the model is trained, we evaluate the model against metrics such as Precision and Recall. These models simultaneously give us the detection box(bounding box) and the class of the object(car, pedestrians, traffic lights etc). This can be seen in below Figure.One major point to note is that an object detection model detects objects in one image at a time. So the same object in consecutive frames are separate detections. There is no connection across frames that can be derived using an object detector only. This is where the next stage comes in – Tracking.

    • Tracking Objects
      The purpose of tracking is to be able to extract trajectories of the vehicles, so as to feed to the learning model.
      Tracking pipeline is as follows-

      • Tracking via Simple and Online Realtime Tracker(SORT)
        The 2D detected bounding boxes from the detector are tracked in the camera view. The SORT is an Extended Kalman Filter based tracker that tracks the 2d rectangular box’s location and size. The IDs tracked from the tracker are used further to track these locations in the bird’s eye view shown in figure.
      • Homography transform
        We take the 2D bounding box that is detected and use the mid-point of the lowest edge of the box. We transform this point in the image to the birds-eye-view(BEV) using a pre computed homography transform.
      • Birds-eye-view tracking
        The tracker’s prediction uses a constant velocity motion model in the pixel space.
        The data association was treated as a linear assignment problem and was solved using the Hungarian’s method. Scipy’s implementation of the same was used.
      • Tracking evaluation in BEV using MOTA/MOTP metrics
        We use the Multi Object Tracking metrics for evaluation of the trajectories in the birds-eye-view. We use a distance threshold of 20 pixels (~ 2metres)


The modeling subsystem provides a model for generating realistic behavior for agents at a road intersection inside the simulation. It tries to learn realistic behaviors from a combination of real-world and hand-crafted simulated data. The model replicates the traffic behavior from the trajectories extracted by the preprocessing subsystem. This model is also known as Optimal policy in the Imitation and Reinforcement Learning domain. It will also provide tunable parameters to observe certain behaviors more often, hence making the simulation platform more suitable for autonomous vehicle testing. The system is not “learning how to drive” rather it is trying to model a traffic scenario at an intersection for testing a self-driving vehicle. Hence we assume that the entire world state is known to the subsystem.

Further implementation details can be found here.

Simulation Subsystem

The Simulation Subsystem takes care of simulating the realistic environment given model from the Modelling subsystem. Its major function is running a specific scenario and supporting multiple agents simultaneously. This system also deals with integration with Steering wheel systems like Logitech G920.  We will primarily be using CARLA software as the simulation platform.

Implementation Details:

Data Capture Subsystem

Data Processing Subsystem

Modelling Subsystem

Simulation Subsystem