Chapter 6: Architectures and Algorithms for Autonomous Vehicles
Synopsis
The development of fully autonomous vehicles hinges on the seamless integration of complex hardware platforms and sophisticated software algorithms. At its highest level, an autonomous vehicle (AV) architecture must provide reliable sensing of the external environment, accurate interpretation of dynamic scenes, robust decision making under uncertainty, and precise control of the vehicle’s motion. This chapter explores the multi-layered architectures and the core algorithms that together form the “brain” and “nervous system” of autonomous driving systems. From sensor fusion pipelines and perception networks to behaviour planners and motion controllers, each component must satisfy stringent requirements for latency, reliability, and safety, often under highly variable road and weather conditions.
Central to any AV is its sensor suite and the data-processing architecture that transforms raw signals into a coherent understanding of the world. Modern AVs typically employ a heterogeneous array of sensors high-resolution cameras, lidar scanners, radar units, ultrasonic sensors, and inertial measurement units (IMUs). Each sensor modality offers complementary strengths: cameras capture texture and colour; lidar provides accurate three-dimensional range measurements; radar excels at detecting objects under adverse weather; and IMUs supply precise information on vehicle acceleration and orientation. The architecture for handling this deluge of data often follows a layered design: at the lowest level, dedicated hardware accelerators preprocess raw sensor streams (e.g., image rectification, point-cloud clustering), while higher-level compute nodes execute deep-learning inference and classical filtering algorithms in parallel.
At the heart of the perception stack lies sensor fusion, where measurements from disparate sensors are merged to generate a unified, probabilistic model of the surrounding environment. Whether implemented via extended Kalman filters, particle filters, or neural-network-based fusion layers, these algorithms must reconcile differing update rates, measurement uncertainties, and field-of-view characteristics. The resulting fused representation typically takes the form of an occupancy grid, object list, or semantic map. Crucially, the fusion process must be designed for real-time operation often within tens of milliseconds to ensure that subsequent planning modules act on the most up-to-date information.
-
Software stack: from perception to motion control
The software stack of an autonomous vehicle orchestrates a sequence of interdependent modules from raw sensor inputs through perception, decision making, and finally to precise motion control that together enable safe and reliable self-driving operation. At its base lies the perception layer, which ingests heterogeneous data streams from cameras, lidars, radars, and inertial measurement units (IMUs). Each sensor modality contributes complementary information: cameras deliver rich colour imagery for object classification, lidars provide accurate three-dimensional point clouds for obstacle detection, radars excel at measuring object velocity in adverse weather, and IMUs furnish high-frequency information on vehicle acceleration and orientation. The perception layer typically employs deep-learning networks such as convolutional neural networks (CNNs) for image segmentation and point-based networks (e.g., Point Net) for lidar clustering to extract semantic information (vehicles, pedestrians, lanes) and estimate dynamic object states (positions, velocities, headings) in real time.
Immediately above perception sits the sensor fusion and localization subsystem, which merges multi-sensor outputs into a coherent situational model. Classical algorithms like extended Kalman filters (EKFs) or more advanced factor-graph optimizers (e.g., GTSAM) reconcile asynchronous measurements, account for varying measurement uncertainties, and produce both the vehicle’s pose on a high-definition (HD) map and a fused object list. Within this layer, simultaneous localization and mapping (SLAM) techniques may also operate to correct for drift by aligning real-time sensor observations with pre-built map features, such as lane markings or roadside landmarks. The result is an accurate, low-latency estimate of “where am I” and “what is around me,” which the downstream planning modules critically depend upon.
The perception-to-planning interface then abstracts the fused environment into structured representations typically occupancy grids for free-space analysis, graph models for road networks, and kinematic states for agents. This abstraction allows the behavioural planner to make high-level decisions: selecting manoeuvres such as lane changes, turns, or yielding to other road users. Behavioural planning often uses state machines or behaviour trees to encode traffic rules and acceptable manoeuvre sequences, while probabilistic frameworks (e.g., Markov decision processes) can model uncertainties in other agents’ intentions. The output is a sequence of tactical objectives “change to the left lane,” “prepare to turn right at the next intersection,” or “stop behind the lead vehicle.”
