System Design Interview: Scaling to 10M+ Users | Complete 2026 Guide

Q: API Gateway & Service Discovery

At scale, you don't expose microservices directly. An API Gateway sits between clients and your backend services, providing: Tools: Kong, AWS API Gateway, Nginx (with Lua), Envoy Proxy, Traefik.

Q: Scaling the Web/Application Tier (Statelessness)

Your application servers must be completely Stateless. This is the most important architectural principle for horizontal scalability:

Thinking Beyond the Single Server

In senior technical interviews (L5/L6 at FAANG, Staff at startups, Principal at scale-ups), you aren't being asked to build an app — you're being asked to architect a highly available, fault-tolerant, globally distributed system. Whether it's Google, Amazon, Microsoft in the US, Spotify in Sweden, Grab in Singapore, or Flipkart in India — the system design interview evaluates your ability to think at scale. 10 million users is the classic benchmark, but the principles extend from 1M to 1B+.

This guide walks you through every layer of a production-grade architecture, the math behind capacity planning, and exactly how to structure your 45-minute interview for maximum impact.

Step 0: Capacity Estimation — Start with Math

Before drawing any boxes, quantify the problem. This impresses interviewers immediately and separates senior candidates from juniors who jump straight into architecture.

Example calculation for a social media feed (10M DAU):

Daily Active Users (DAU): 10M
Requests per user per day: ~20 (feed loads, likes, comments, profile views)
Total daily requests: 200M/day = ~2,300 requests/second (QPS)
Peak QPS (2-3x average): ~5,000-7,000 QPS
Storage per post: 1KB text + 500KB media = ~500KB average
New posts per day (5% of users post): 500K posts × 500KB = ~250GB/day new data
Read:Write ratio: ~100:1 (reads far exceed writes — this heavily influences your caching and replication strategy)

Why this matters: These numbers drive every architectural decision. 5,000 QPS is manageable with a few horizontally scaled servers. But 500K QPS (for a billion-user system) requires fundamentally different patterns. Always anchor your design in real numbers.

1. The Foundation: DNS, CDN, and Load Balancing

Before traffic hits your servers, it passes through several layers:

DNS (Domain Name System): Translates your domain (app.example.com) to an IP address. For global availability, use GeoDNS to route users to the nearest data center. Services: AWS Route 53, Cloudflare DNS, Google Cloud DNS.
CDN (Content Delivery Network): Serve all static assets (images, videos, JS, CSS bundles) from edge locations physically closest to the user. This alone can reduce latency by 50-80% for global users. Services: Cloudflare, AWS CloudFront, Fastly, Akamai.
Load Balancer (LB): Distributes incoming requests across multiple application servers. For 10M users, you need multiple tiers:

L4 (Transport Layer): Routes based on IP and TCP port. Fast, low-overhead. Used for initial traffic distribution (e.g., AWS NLB).
L7 (Application Layer): Routes based on HTTP headers, URL paths, cookies. Enables path-based routing (/api → API servers, /static → CDN origin). Used for intelligent routing (e.g., Nginx, HAProxy, AWS ALB).
Algorithms: Round Robin (simple, equal distribution), Weighted Round Robin (send more traffic to stronger servers), Least Connections (route to the server handling the fewest requests), IP Hash (sticky sessions for stateful scenarios).

2. API Gateway & Service Discovery

At scale, you don't expose microservices directly. An API Gateway sits between clients and your backend services, providing:

Rate Limiting: Protect your system from abuse. Example: "Each user can make at most 100 API calls per minute." Implement using a Token Bucket or Sliding Window algorithm backed by Redis.
Authentication/Authorization: Validate JWT tokens, check permissions, and reject unauthorized requests before they reach your business logic — reducing load on downstream services.
Request Routing: Route /users/* to User Service, /orders/* to Order Service, /payments/* to Payment Service.
Response Caching: Cache frequently requested, rarely changing data at the gateway level (e.g., product catalog, country list).
Service Discovery: In a microservices architecture, services need to find each other dynamically. Use tools like Consul, etcd, or Kubernetes Service DNS for automatic service registration and health checking.

Tools: Kong, AWS API Gateway, Nginx (with Lua), Envoy Proxy, Traefik.

3. Scaling the Web/Application Tier (Statelessness)

Your application servers must be completely Stateless. This is the most important architectural principle for horizontal scalability:

No local session storage: Any session data, user tokens, or temporary state MUST be stored in a shared external cache (Redis, Memcached). This allows your Auto-Scaling groups to freely destroy or create new server instances without losing user context.
Containerization: Package each service as a Docker container. Use Kubernetes (K8s) for orchestration — it handles auto-scaling, rolling deployments, self-healing (restarting failed pods), and load balancing automatically.
Auto-Scaling policies: Scale based on CPU utilization (>70%), memory usage, request queue depth, or custom metrics. AWS Auto Scaling Groups, GCP Managed Instance Groups, and K8s Horizontal Pod Autoscaler (HPA) all support this.
Health Checks: Every server must expose a /health endpoint. The load balancer continuously polls this endpoint and removes unhealthy instances from the pool within seconds.

4. Managing the Database Bottleneck

The database is the most common bottleneck at scale. 10M users will crush a single relational database. You need a multi-layered strategy:

SQL vs NoSQL Decision Framework

Criteria	Choose SQL (PostgreSQL, MySQL)	Choose NoSQL (MongoDB, DynamoDB, Cassandra)
Data Relationships	Complex relationships, many JOINs	Denormalized, self-contained documents
Consistency Model	Strong consistency (ACID)	Eventual consistency acceptable
Schema	Fixed, well-defined schema	Flexible, evolving schema
Scale Pattern	Vertical first, then read replicas	Horizontal from the start
Best For	Banking, orders, user accounts	Feeds, logs, IoT, real-time analytics

Scaling Strategies

Read Replicas: Route all SELECT statements to replicated follower databases (async replication) to ease the burden on the Primary write node. Most applications are 90%+ reads. Tools: PostgreSQL streaming replication, MySQL read replicas, Amazon RDS.
Caching (Look-Aside / Write-Through): Use Redis or Memcached between your application and database. Check the cache before hitting the DB for read-heavy operations. Cache hit rates of 95%+ mean your database only handles 5% of actual traffic. Understand cache invalidation strategies: TTL-based, event-driven, write-through.
Sharding (Horizontal Partitioning): When data is too large for one server, split the database horizontally. Common shard keys: User ID % N (good for user-centric data), Geographic region (good for localized services), Hash-based (even distribution). Challenges: cross-shard queries, rebalancing, hotspots.
Connection Pooling: Use PgBouncer (PostgreSQL) or ProxySQL (MySQL) to manage database connections efficiently. A single PostgreSQL instance typically handles 100-500 concurrent connections — without pooling, 1,000 application pods would each try to open separate connections, overwhelming the DB.

5. The CAP Theorem — Know It Cold

In any distributed system, you can guarantee at most 2 of 3 properties:

Consistency (C): Every read returns the most recent write. All nodes see the same data at the same time.
Availability (A): Every request receives a response (even if it's not the most recent data). The system never refuses to respond.
Partition Tolerance (P): The system continues to operate despite network partitions (communication failures between nodes).

In practice: Network partitions WILL happen, so you always need P. The real choice is between CP (sacrifice availability for consistency — e.g., banking systems, HBase) and AP (sacrifice consistency for availability — e.g., social media feeds, Cassandra, DynamoDB). In an interview, explicitly state your choice and why.

6. Asynchronous Processing (Message Queues)

Never make a user wait for a slow process. This is a core principle of responsive system design:

Pattern: If a user uploads an image, immediately return a "success" response (HTTP 202 Accepted). Push the actual image resizing, thumbnail generation, and virus scanning into a Message Queue. Worker nodes pull tasks asynchronously.
Use Cases: Email/SMS notifications, payment processing, ML inference pipelines, data aggregation, report generation, and any operation taking >500ms.
Tool Selection:

Kafka: Distributed event streaming. Best for high-throughput, event-sourcing, and real-time data pipelines. Used at LinkedIn, Netflix, Uber. Retains messages for configurable periods (days/weeks), enabling reprocessing.
RabbitMQ: Traditional message broker. Best for task queues, pub/sub with complex routing, and when message acknowledgment guarantees are critical.
Amazon SQS: Fully managed, serverless. Best for simple async task processing without managing infrastructure. Integrates natively with Lambda.
Redis Streams: Lightweight, fast. Best when you already use Redis and need simple stream processing without deploying a separate broker.

7. Resiliency Patterns — Designing for Failure

At 10M+ users, failures aren't exceptions — they're constants. Senior engineers design systems that gracefully degrade rather than catastrophically fail:

Circuit Breaker: If a downstream service fails repeatedly, stop calling it temporarily. After a cooldown period, send a test request. If it succeeds, resume normal traffic. Libraries: Hystrix (deprecated but conceptually important), Resilience4j (Java), Polly (.NET).
Retry with Exponential Backoff: If a request fails, retry after 1s, then 2s, then 4s, then 8s — with jitter (random offset). This prevents the "thundering herd" problem where thousands of retries hit the recovering server simultaneously.
Bulkhead Pattern: Isolate critical resources (database connections, thread pools) per service. If Service A's connection pool is exhausted, it shouldn't prevent Service B from accessing the database.
Graceful Degradation: If the recommendation engine is down, show trending content instead of personalized feeds. If the image service is slow, serve lower-resolution cached thumbnails. The user experience degrades slightly but never breaks entirely.
Idempotency: Design APIs so that retrying the same request multiple times produces the same result. Critical for payment systems — a user clicking "Pay" twice shouldn't result in double charges. Use idempotency keys.

8. Monitoring, Observability & Alerting

A senior design isn't complete without Observability — the "three pillars":

Metrics (Quantitative): CPU utilization, memory, request latency (p50, p95, p99), error rates, QPS, database connection pool usage. Tools: Prometheus + Grafana, Datadog, CloudWatch.
Logging (Qualitative): Structured logs (JSON) with request IDs for tracing. Centralized logging using ELK stack (Elasticsearch, Logstash, Kibana) or Grafana Loki. Critical for debugging production issues.
Distributed Tracing: Track a single request as it flows through 5-10 microservices. Identify which service is the bottleneck. Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM.
Alerting: Set up PagerDuty/OpsGenie alerts for critical thresholds: error rate > 1%, p99 latency > 2s, disk usage > 80%. Define runbooks for each alert type.

9. Security at Scale

Mentioning security unprompted in a system design interview demonstrates senior-level thinking:

Authentication & Authorization: OAuth 2.0 / OpenID Connect for third-party auth. JWT with short expiry + refresh tokens for API auth. Role-Based Access Control (RBAC) for authorization.
Data Encryption: TLS 1.3 for all data in transit. AES-256 for sensitive data at rest. Never store passwords — use bcrypt or Argon2 for hashing.
DDoS Protection: Use Cloudflare or AWS Shield for edge-level protection. Implement rate limiting at the API Gateway.
Input Validation: Validate and sanitize all user inputs. Use parameterized queries to prevent SQL injection. Implement Content Security Policy (CSP) headers to prevent XSS.

Classic System Design Problems & What They Test

Problem	Core Concepts Tested	Companies That Ask
Design Twitter/X Feed	Fan-out, caching, timeline generation, pub/sub	Meta, Twitter, Google
Design URL Shortener (TinyURL)	Hashing, base62 encoding, read-heavy optimization	Amazon, Microsoft (entry-level SD)
Design WhatsApp/Chat System	WebSockets, message queues, presence detection, E2E encryption	Meta, Grab, Shopee
Design YouTube/Netflix	CDN, video transcoding, adaptive bitrate, storage	Google, Netflix, Amazon
Design Uber/Ride Sharing	Location indexing (Geohash/S2), matching algorithms, ETA calculation	Uber, Grab, Lyft
Design Rate Limiter	Token bucket, sliding window, distributed rate limiting	Stripe, Cloudflare, API companies
Design Notification System	Priority queues, delivery guarantees, multi-channel (push/email/SMS)	Amazon, Flipkart, any e-commerce

The 4-Step Interview Framework (45 Minutes)

In a real interview, drive the conversation using this proven structure:

Requirements & Scope (5 min): Ask clarifying questions. "Should we focus on the read path or write path? What's the expected scale? Is global availability required? What are the SLA requirements?" Narrow the problem space.
Capacity Estimation (5 min): Do the back-of-envelope math. Calculate QPS, storage needs, and bandwidth. This anchors all subsequent decisions in reality.
High-Level Design (15 min): Draw the architecture — clients, load balancers, API servers, databases, caches, message queues, CDN. Explain the data flow for both read and write paths. Define the API contracts.
Deep Dive (15-20 min): The interviewer will pick 1-2 components to deep dive into. Be ready to discuss: database schema design, caching strategy, how you'd handle a specific failure scenario, or how to optimize a particular bottleneck. This is where senior-level candidates shine by discussing trade-offs proactively.

MockExperts provides an interactive System Design whiteboard powered by AI that critiques your architectural trade-offs in real-time. Practice the full 45-minute cycle — including capacity estimation, high-level design, and deep dives — with AI feedback on your communication clarity, technical depth, and time management. Available 24/7 for candidates anywhere in the world.

Common System Design Mistakes

Jumping to the solution without clarifying requirements: This is the #1 mistake. Always ask scope questions first.
Over-engineering for scale you don't need: Don't propose Kafka, Kubernetes, and 100 microservices for a system with 1,000 users. Start simple and explain how you'd scale up.
Ignoring trade-offs: Every decision has a cost. "I chose Redis for caching because of its O(1) lookups and TTL support, but the trade-off is additional infrastructure cost and cache invalidation complexity."
Not discussing failure modes: "What happens if Redis goes down? We'd have a cache-aside pattern where the application falls back to the database, with slightly higher latency but no data loss."
Drawing without explaining: The whiteboard is a communication tool. Narrate every box you draw: "This is our write-through cache that sits between the API server and the database."
Single point of failure (SPOF): Every component should be replicated. Single database? Add replicas. Single load balancer? Add redundancy. Interviewers actively look for SPOFs in your design.

Legal Disclaimer

MockExperts is an independent platform. We do not use proprietary internal information from any specific technology company. All architecture patterns, tools, and approaches discussed are standard industry best practices based on published engineering blogs and documentation. Company names and cloud service names are trademarks of their respective owners and are used here for educational and nominative fair use purposes only.

System Design for 10M+ Users: Scaling Architectures for Senior Roles