“Design a globally distributed configuration propagation service that pushes config updates to tens of thousands of servers within seconds, with versioning, rollback, and strong delivery guarantees.”
Here’s how to approach it:
Start by clarifying the core requirements:
- Config changes must propagate worldwide within seconds
- Strong versioning and atomic rollout per region
- Rollback must be instantaneous
- Agents must validate the integrity and signature of configs
- Updates must be durable, auditable, and conflict-free
Core components:
- Control plane API and metadata store
- Regional coordinators with version tracking
- Fan-out push clusters (WebSocket / long-poll)
- Edge agents with local cache + signature verification
Primary flow:
- Admin submits config draft -> validated and versioned
- Control plane writes an immutable version record
- Regional coordinators fetch the new version and publish rollout metadata
- Push clusters notify connected agents
- Agents fetch, verify, apply, persist, then ack
Reliability / Guarantees:
- At-least-once notification, exactly-once version application
- Commit = agent-verified checksum and signature
- Agent retries until a successful fetch
- Coordinators track rollout health; failed agents quarantined
- Rollback = publish the prior version as the new active pointer
Scaling strategy:
- Coordinators horizontally sharded by region
- Push clusters scaled via connection fan-out; stateless frontends
- Agents maintain persistent connections to the nearest region
- Version store globally replicated via multi-region quorum
- Backpressure via staged rollouts
Data & storage:
- Version metadata: strongly consistent store (etcd/Spanner/ZK)
- Config blobs: object storage with immutable keys
- Hot metadata cached at coordinators
- Agents store applied versions locally for restart resilience
- Indexed by version, region, rollout status
Observability & Ops:
- Metrics: propagation latency, success rate, agent ack skew
- Logging: version creation, audit trails, signature verification results
- Tracing: publish path from control plane → coordinators → push nodes
- Alerts: stalled regions, agent failure clusters, version drift
Edge cases & trade-offs:
- Coordinators overloaded: staggered rollout windows
- Split-brain version pointers: strong quorum guards
- Agents offline for long periods: delayed version reconciliation
- Cost trade-off: persistent connections vs periodic pull
- Propagation latency vs blast radius (progressive deployments)
How to say it in an interview:
“I’d design this system using a global control plane with immutable versioning, regional coordinators for scoped rollout, and fan-out push clusters for low-latency propagation. The system scales through regional sharding and stateless push nodes, maintains reliability via version pointers, retries, and signature verification, and remains observable with latency, ack, and health metrics. This delivers rapid, safe config distribution at a global scale.”
If you like Tweets like this, you will absolutely enjoy my exclusive weekly newsletter,
Sharing exclusive backend engineering resources to help you become a great Backend Engineer.
“How would you design a distributed cron scheduling system that ensures tasks run exactly once, on time, across multiple nodes without collisions or duplicates?”
Here's how to approach it:
If you want a complete breakdown of this question and more. Subscribe to
- Define and store cron schedules centrally.
- Multiple scheduler nodes, but only one should trigger a job at any moment.
- Tasks must run exactly once, even if nodes restart or fail.
- Must support retries, backoff, and idempotent execution.
- Visibility into last-run / next-run times.
“How would you design a multi-tenant notification delivery system that handles email, SMS, and push at scale?”
Here’s how to approach it:
Start with the requirements:
- Multi-tenant isolation (quotas, limits, branding, templates).
- Support multiple channels (email, SMS, push).
- Fault-tolerant, retry-capable delivery
- Ability to plug in multiple third-party providers per channel
- Message tracking and audit logs
Core components:
- Notification API to receive requests from tenant apps.
- Router to classify channel type and pick a provider.
- Queue layer to buffer and retry.
- Workers per channel: email-worker, sms-worker, push-worker.
- Provider adapters to normalize interactions.
- Secure login & session management.
- Support for web + mobile clients.
- Handle millions of users.
- Protect against common attacks (replay, token theft).
- Support multiple channels (email, SMS, push).
- Handle high throughput (millions of notifications).
- Guarantee delivery as much as possible.
- Allow retries and user-level preferences.