Chapter 4: Data, Features, and Vector Infrastructure
Synopsis
Data is the raw material of intelligent clouds, but it only becomes leverage when it is organized into features that models can trust and into vectors that applications can search by meaning. This chapter frames “Data, Features, and Vector Infrastructure” as the substrate that powers every other pattern in the book, autoscaling that anticipates demand, semantic routing that finds the right service or model, retrieval-augmented generation that grounds LLMs, and governance that proves decisions were made responsibly. We move beyond storage to living systems: streams that continuously shape signals, feature stores that guarantee offline/online parity, and vector databases that turn text, images, audio, and logs into dense representations ready for millisecond retrieval. The goal is reliability and relevance: correct features now of use, vectors that reflect the latest truth, and contracts that keep teams aligned as data and models evolve.
The path from raw data to intelligent behavior starts with acquisition and normalization. Change data capture (CDC) connects operational databases to streaming backbones; object stores provide durable, cheap history; and schema registries keep producers and consumers from drifting apart. Yet correctness is not just “schema matches.”
A modern contract encodes units, ranges, null semantics, and lineage, plus data-quality SLOs like freshness, completeness, and drift thresholds. In practice, each record carries provenance, source, timestamp, sensitivity tags, and is validated at ingress with executable rules. Privacy-by-design enters here: minimize fields, tokenize, or redact early, and attach residency constraints that follow the payload through pipelines.
Real-time feature engineering: streaming joins, windows, and CDC
Real-time feature engineering converts raw streams into decision-ready signals fast enough to influence live products. Instead of nightly ETL, systems compute features continuously using streaming joins, windowed aggregations, and change data capture (CDC) from source systems. Streaming joins align heterogeneous event streams, clicks, payments, inventory, on time and keys so models see coherent snapshots. Windows transform unbounded flows into bounded statistics such as counts, rates, or recency; session windows capture human behavior burst better than fixed intervals. CDC turns operational database mutations into ordered, replay able events, letting features reflect truth without hammering OLTP.
1. Streaming joins and time semantics
Streaming joins create a shared “now” across asynchronous sources, but correctness depends on time semantics. Event time respects when something happened; processing time reflects when the system saw it. Use event time with watermarks so engines wait long enough for stragglers without stalling indefinitely. Choose join types deliberately: inner for high precision, left/outer to preserve sparse keys, interval and temporal-table joins to bind facts to the valid at the event’s timestamp. Prevent skewing by partitioning on composite keys and salting hot partitions.
2. Windows: tumbling, sliding, and session features
Windows convert infinite streams into bounded context. Tumbling windows produce non-overlapping aggregates ideal for reporting and stable rates; sliding windows track recency with overlap, improving sensitivity to change; session windows group bursts separated by inactivity, matching human behavior for personalization, fraud, and anomaly baselines. Define window size from model sensitivity and business latency budgets, not habit, short windows react quickly but increase variance, long windows smooth noise but lag. Use keyed state for per-entity windows and pre-aggregate with combiners to reduce shuffle cost.
