Subsystem Description

Data Capture

The Data Capture subsystem aims to capture and use data from which we can extract realistic behavior for our models. We experimented with different sources for this – open driving data sets, the steering wheel based system, simulator data and the traffic camera based videos.

Incorporating open datasets turned out to be challenging as they did not include all the details we required and inferring those information was not always feasible and could lead to inaccuracies.

We had built a pipeline for collecting data with steering wheel based system. We had planned on getting people to drive within simulator and record the states. With the onset of the pandemic, we decided against this to maintain social distancing.

Finally we settled on primarily using simulator data for prototyping and majority of the behavior parameters. We also use real-world data but due to limited availability of such data and also sparsity of any non-traditional behavior, we use it mostly as a “test set”.

Image and Trajectory Processing

Once we have the recorded video of the intersections, the second major subsystem comes into play. The preprocessing subsystem uses the video and extracts useful information(such as position, orientation, velocity, state) about all the relevant actors(cars and pedestrians) and environment objects(traffic lights, signs etc).
The system will detect, track and store the states of each object across the given time range of the video.

The major steps within this sub system are described below,

    • Object Detection

      In this stage, we want to detect and classify all the relevant objects in the captured video. We a deep neural net based approach for the detection and classification of the objects as they give a higher accuracy that we need. We are using a Faster-RCNN model from the detectron2 framework which performs really well for our use-case. The model simultaneously give us the detection box(bounding box) and the class of the object. We used a model pre-trained on the MS-COCO dataset and fine tuned the model on images and groundtruth captured from CARLA. We used the same traffic-camera based data-capture pipeline discussed previously. We had 2688 images in the training set and 672 images in the test dataset.
      After the model is trained, we evaluate the model against metrics such as Precision and Recall. These models simultaneously give us the detection box(bounding box) and the class of the object(car, pedestrians, traffic lights etc). This can be seen in below Figure.
      One major point to note is that an object detection model detects objects in one image at a time. So the same object in consecutive frames are separate detections. There is no connection across frames that can be derived using an object detector only. This is where the next stage comes in – Tracking.

    • Tracking Objects
      The purpose of tracking is to be able to extract trajectories of the vehicles, so as to feed to the learning model.
      Tracking pipeline is as follows-

      • Tracking via Simple and Online Realtime Tracker(SORT)
        The 2D detected bounding boxes from the detector are tracked in the camera view. The SORT is an Extended Kalman Filter based tracker that tracks the 2d rectangular box’s location and size. The IDs tracked from the tracker are used further to track these locations in the bird’s eye view shown in figure.
      • Homography transform
        We take the 2D bounding box that is detected and use the mid-point of the lowest edge of the box. We transform this point in the image to the birds-eye-view(BEV) using a pre computed homography transform.
      • Birds-eye-view tracking
        The tracker’s prediction uses a constant velocity motion model in the pixel space.
        The data association was treated as a linear assignment problem and was solved using the Hungarian’s method. Scipy’s implementation of the same was used.
      • Traffic light detection
        A custom CNN allows us to detect traffic lights in every frame.
      • Tracking evaluation in BEV usingMOTA/MOTP metrics
        We use the Multi Object Tracking metrics for evaluation of the trajectories in the birds-eye-view. We use a distance threshold of 20 pixels (~ 2metres)


The modeling subsystem provides a model for generating realistic behavior for agents at a road intersection inside the simulation. It tries to learn realistic behaviors from a combination of real-world and hand-crafted simulated data. The model replicates the traffic behavior from the trajectories extracted by the preprocessing subsystem. This model is also known as Optimal policy in the Imitation and Reinforcement Learning domain. It will also provide tunable parameters to observe certain behaviors more often, hence making the simulation platform more suitable for autonomous vehicle testing. The system is not “learning how to drive” rather it is trying to model a traffic scenario at an intersection for testing a self-driving vehicle. Hence we assume that the entire world state is known to the subsystem.

Further implementation details can be found here.

Simulation Subsystem

The Simulation Subsystem takes care of simulating the realistic environment given model from the Modelling subsystem. Its major function is running a specific scenario and supporting multiple agents simultaneously. We will primarily be using CARLA software as the simulation platform.

Implementation Details:

Data Capture Subsystem

Data Processing Subsystem

Modelling Subsystem

Simulation Subsystem