System Design: Scaling to 100 Million Users | Blogs — Ashok Murugan

Introduction — The Scale Wall

Scaling from 1 to 1 million users is about fixing bugs and adding features. Scaling from 1 million to 100 million users is about managing physics. At this scale, even a "one in a billion" error happens several times a day.

As a Staff Engineer at FAANG or a high-growth startup, your job is to anticipate the "Scale Walls" before you hit them.

Milestone 1: The Database Bottleneck

Most systems start with a single relational database. At 1 to 5 million users, you'll hit the limits of vertical scaling (buying a bigger server).

The Solution: Sharding
Instead of one big database, you split your data across 100 smaller ones.

User-based Sharding: Users 1-1M are on Node A, Users 1M-2M are on Node B.
Challenge: Joins become impossible across shards. You must move to an Eventually Consistent event-driven model.

Milestone 2: The Global Latency Wall

At 10 million users, you have people using your app in San Francisco, London, and Tokyo. If your servers are only in Virginia, the "Speed of Light" becomes your biggest competitor.

The Solution: Multi-Region Active-Active
You must run your application in multiple cloud regions simultaneously.

Challenges: How do you sync data between continents?
Strategy: Use "Geo-routing" at the CDN level (Cloudflare/Fastly) to send users to the nearest bucket of servers.

Milestone 3: The "Blast Radius" Management

At 100 million users, a single bug in a core service can take down the whole platform.

The Solution: Cell-based Architecture
Instead of one giant "cloud," you build the system as a collection of independent "Cells." If Cell 5 (serving 5 million users) fails, Cells 1-4 and 6-20 stay perfectly healthy. This is how platforms like Discord and Slack maintain high availability despite having hundreds of millions of users.

Summary of the Scaling Path:

Vertical Scaling: Bigger boxes.
Read Repilcas: Offload reads from the master DB.
Caching: Memcached/Redis for frequent data.
Microservices: Decouple teams and logic.
Horizontal Sharding: Grow the DB indefinitely.
Global Multi-Region: Kill the speed-of-light lag.

"Scaling isn't about being 'bigger'; it's about being 'smaller' through better isolation."

Conclusion

Scaling to 100 million users is a journey of increasing complexity and deliberate isolation. Every layer you add—whether it’s sharding, caching, or cell-based architecture—is a trade-off that buys you another 10x in capacity.