← All writing
30 March 2026 · 6 min read

Real-Time Systems: What Actually Breaks at Scale

I spent three and a half years as the sole engineer behind a real-time transport platform — live GPS, geofenced arrivals, push notifications, thousands of journeys a day for vulnerable passengers. Real-time looks magical in a demo with one device on good wifi. Production is where you learn what it actually costs.

The network is hostile, always

A driver's phone goes through a tunnel, drops to 2G, switches towers, loses signal in a car park. Your "real-time" system has to degrade gracefully through all of it — buffer locally, reconcile on reconnect, never show a parent a stale position as if it were live. The hard engineering isn't the happy-path stream; it's the reconnection story.

Sub-second updates are a budget, not a feature

Every device pushing location every second is a write-amplification problem waiting to happen. You design for the aggregate: batching, debouncing, geofence events instead of raw firehose, and a data model that can answer "where is vehicle X now" cheaply without scanning history. Decide what "real-time enough" means per use case — the ops dashboard and the parent app have very different needs.

Push notifications are a distributed-systems problem in disguise

Delivery isn't guaranteed, ordering isn't guaranteed, and a duplicate "your child has arrived" alert erodes trust fast. Idempotency keys, dedup windows, and a clear story for the notification that fires twice or not at all — that's most of the work behind a feature that looks like one line.

Trust is the real SLA

For safety-critical real-time software, the metric that matters isn't latency — it's whether people believe what the screen tells them. 99.5%+ uptime mattered, but never showing a confidently-wrong position mattered more. Build for honesty under degradation, and the rest follows.

← Back to Nathaniel Wilson