Real-Time Systems: What Actually Breaks at Scale
- real-time
- architecture
- mobile
I spent three and a half years as the sole engineer behind a real-time transport platform — live GPS, geofenced arrivals, push notifications, thousands of journeys a day for vulnerable passengers. Real-time looks magical in a demo with one device on good wifi. Production is where you learn what it actually costs.
The network is hostile, always
A driver's phone goes through a tunnel, drops to 2G, switches towers, loses signal in a car park. Your "real-time" system has to degrade gracefully through all of it — buffer locally, reconcile on reconnect, never show a parent a stale position as if it were live. The hard engineering isn't the happy-path stream; it's the reconnection story.
Sub-second updates are a budget, not a feature
Every device pushing location every second is a write-amplification problem waiting to happen. You design for the aggregate: batching, debouncing, geofence events instead of raw firehose, and a data model that can answer "where is vehicle X now" cheaply without scanning history. Decide what "real-time enough" means per use case — the ops dashboard and the parent app have very different needs.
Push notifications are a distributed-systems problem in disguise
Delivery isn't guaranteed, ordering isn't guaranteed, and a duplicate "your child has arrived" alert erodes trust fast. Idempotency keys, dedup windows, and a clear story for the notification that fires twice or not at all — that's most of the work behind a feature that looks like one line.
Trust is the real SLA
For safety-critical real-time software, the metric that matters isn't latency — it's whether people believe what the screen tells them. 99.5%+ uptime mattered, but never showing a confidently-wrong position mattered more. Build for honesty under degradation, and the rest follows.