Why Google System Design Is Different
Google's system design interviews are widely considered the most rigorous in the industry — and for good reason. At Google, "scalable" doesn't mean "handles 10,000 users." It means "handles 1 billion users across 200 countries on infrastructure that cannot afford downtime." The engineers you're being evaluated against have built Gmail, Google Search, YouTube, and Google Maps. The bar is calibrated accordingly.
For an SDE-3 role (equivalent to Senior Software Engineer, L5 in Google's internal leveling), the system design round is often the deciding factor. You're expected to arrive, given an open-ended prompt, and design a production-grade distributed system in 45–60 minutes — thinking out loud the entire time, handling constraints, discussing trade-offs, and driving the conversation.
This guide covers the 15 most important Google system design questions with the depth and framing that earns offers at the L5 level.
Google's System Design Evaluation Rubric
Before diving into questions, understand what Google's interviewers are scoring. They use an internal rubric that maps to four dimensions:
- Breadth of knowledge: Do you know the right tools and patterns? (Bigtable vs Spanner vs Firestore — and when to use each.)
- Ability to identify constraints: Do you ask the right clarifying questions before designing? Scale, consistency requirements, latency SLAs.
- Design process quality: Do you start high-level and progressively detail? Or do you dive into implementation details before establishing the architecture?
- Trade-off awareness: The hallmark of a Google SDE-3 is the ability to say "I chose X over Y because of Z, and the downside of X is W which we mitigate by..." — not just listing components.
The 15 Questions
Q1: Design Google Search
The one everyone dreads. This is intentionally open-ended — Google wants to see how you scope it.
Start by scoping: "I'll focus on: web crawling, indexing, query processing, and ranking. I'll defer ads, UI personalization, and autocomplete unless we have time."
Key components and Google-flavoured decisions:
- Crawler: Distributed BFS across the web. Politeness delays per domain (robots.txt). Priority queue weighting by PageRank score. Google's actual crawler (Googlebot) is a massive distributed system using Bigtable to store crawl state.
- Document store: Raw HTML stored in a distributed blob store (GCS/S3 equivalent). Content-addressed (hash of content = key) to detect duplicates.
- Inverted index: The core data structure. Maps
word → [docId, position, frequency, ...]. Google uses an index shard strategy (sharded by term range). Built offline in batches using MapReduce/dataflow, merged into the live serving index. - Query processing: Query parsing → term expansion → index lookup → document scoring (BM25 + PageRank + 200+ other signals) → top-K aggregation → result rendering.
- Consistency model: Search indexing is eventually consistent by design. A new webpage may take hours to appear in results. This is acceptable because freshness SLA is measured in hours/days, not milliseconds.
Interview signal: Say "Google uses MapReduce and Bigtable for index construction" — this shows you've studied their public systems papers (which Google expects SDE-3 candidates to have read).
Q2: Design YouTube
Scale context to establish: 500 hours of video uploaded per minute. 2 billion monthly logged-in users. 1 billion hours watched daily.
Key design decisions:
- Video upload pipeline: Client upload → raw video storage → async transcoding pipeline (FFmpeg at scale, output 5–8 resolutions from 360p to 4K) → CDN distribution. Transcoding is embarrassingly parallel — split video into chunks, transcode in parallel, reassemble.
- Video storage: Videos are immutable after upload. Object storage (GCS). CDN caching: popular videos cached at edge nodes globally. Long-tail videos served from origin.
- Video serving (adaptive bitrate streaming): HLS or MPEG-DASH. Client requests manifest → fetches segments at appropriate quality based on measured bandwidth. This is what prevents buffering.
- View count and engagement: Counters at YouTube scale can't use traditional DB rows (write contention). Solution: approximate counters using probabilistic structures, with eventual consistency to the displayed count.
- Recommendation system: Two-stage: candidate generation (retrieve 1000 candidate videos) → ranking (score all 1000 with a full ML model). Google published a paper on this — mentioning it earns points.
Q3: Design a Distributed Rate Limiter
Why Google asks this: Rate limiting is a real problem across all Google APIs. They want to see if you understand the distributed coordination challenge.
Key design decisions:
- Local vs global rate limiting: A per-node counter is fast but inaccurate (N nodes each allow N× the rate). A global counter is accurate but introduces latency. Hybrid: local token bucket with periodic sync to a global Redis counter. Accept small tolerance (~10% over-limit) for accuracy.
- Algorithm choice: Token Bucket (allows bursting), Sliding Window Log (precise but memory-intensive), Sliding Window Counter (best accuracy-performance balance using Redis INCR + EXPIRE).
- Multi-dimensional limiting: Per user, per IP, per API key, per endpoint. Each dimension has its own bucket — a request consumes from all applicable buckets simultaneously.
- Failure mode: What happens if Redis is down? Fail-open (allow all requests — risks abuse) vs fail-closed (reject all — breaks functionality). Google typically fails open with monitoring alerts.
- Response headers:
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset,Retry-After(when limited). These are standard and Google interviewers expect you to mention them.
Q4: Design Google Drive / Cloud Storage
Key design decisions:
- Block-level deduplication: Files are split into fixed-size blocks (e.g., 4MB). Each block is content-addressed (SHA-256 hash). If block already exists in storage, don't re-upload — just add a reference. This dramatically reduces storage costs for duplicates (e.g., multiple users have the same PDF).
- Delta sync: On file update, only changed blocks are uploaded, not the entire file. Client computes block hashes locally, sends only changed blocks. Critical for large files with small edits.
- Metadata service: File tree structure (hierarchical namespace), sharing permissions, version history stored in a relational DB (Spanner at Google scale for global consistency).
- Conflict resolution: Two clients editing the same file offline and syncing — Google Drive creates a copy with "conflicting changes". No automatic merge (unlike Git) to avoid data loss.
- Access control: Per-file and per-folder ACLs. Shared drives have different permission inheritance rules. Evaluate on every read/write operation.
Q5: Design a Real-time Collaborative Editor (Google Docs)
This is one of the most technically challenging system design questions — because it requires distributed consensus on a shared mutable document.
Core algorithm: Operational Transformation (OT)
When two users concurrently modify document position P: User A inserts character at position 5; User B deletes character at position 3. After B's delete, A's insertion must be adjusted to position 4. OT defines transformation functions for all operation pairs (insert-insert, insert-delete, delete-delete) to ensure convergence.
- Architecture: Operations flow from client → WebSocket connection → operation server → OT transformation → broadcast to all clients → DB write. The server serializes operations, so the transformation base is always the "server document state."
- Cursor positions: Each user's cursor must be transformed the same way as document operations — B's delete shifts A's cursor position, which must be broadcast.
- Offline support: Queue operations locally. On reconnect, replay against server state with OT. Complex but critical for mobile users.
- Alternative: CRDT (Conflict-free Replicated Data Type): Mention as an alternative. CRDTs don't require a central server for transformation — operations commute mathematically. Used by Figma. Trade-off: higher complexity, larger operation metadata.
Q6: Design the Google Maps Navigation System
Key components:
- Map tile service: World map is divided into tiles at multiple zoom levels (Z0 = world in 1 tile; Z20 = building-level detail). Tiles are pre-rendered images, cached aggressively at CDN. 80% of map views hit cache — this is a read-heavy system that scales horizontally with CDN.
- Route computation: Dijkstra's on a road graph with hundreds of millions of nodes. In practice, A* with heuristics (Euclidean distance) for speed. Google uses Contraction Hierarchies — preprocessing road graph to add "shortcut" edges, enabling near-instantaneous routing even across continents.
- Real-time traffic: Aggregate GPS signals from Android phones (with user consent) to compute current speed per road segment. Weighted into edge costs in the routing graph. Updated every few minutes.
- ETA prediction: Historical travel time data + current traffic + ML model for event detection (accidents, road closures, events). RMSE of ETA predictions is a key product metric.
- Geocoding: Address → coordinates (forward geocoding), coordinates → address (reverse geocoding). Both are lookup-heavy — geospatial indexes (QuadTree, R-tree, Google S2 library) enable efficient spatial queries.
Q7: Design a Global CDN (like Google's CDN)
Core concepts:
- Points of Presence (PoPs): Edge servers distributed globally. DNS Anycast routes user requests to the geographically nearest PoP. Minimizes Round-Trip Time (RTT).
- Cache hierarchy: L1 (edge PoP, small, fast) → L2 (regional cluster, larger, slightly slower) → Origin. On L1 miss, fetch from L2. On L2 miss, fetch from origin and populate both caches.
- Cache eviction: LRU for L1 (recency matters for hot content). LFU or Size-Aware LRU for L2 (frequency matters for regional content).
- Cache invalidation: Purge APIs, TTL expiry, surrogate keys (one purge key invalidates all related content). Propagation to all edge nodes must be fast — Google CDN uses a push-based invalidation model.
- TLS termination: Handled at edge nodes to minimize TLS handshake RTT. QUIC protocol (Google's invention, now HTTP/3 standard) further reduces connection establishment latency.
Q8: Design a Pub/Sub Messaging System (like Google Pub/Sub)
Key design decisions:
- At-least-once delivery: Messages are acknowledged by subscribers. Un-acked messages are redelivered after an ack deadline expires. Publishers must handle idempotency on the consumer side.
- Message ordering: Per-partition ordering within a topic. Global ordering requires single-partition (bottleneck). Specify ordering key to group related messages to the same partition.
- Subscription types: Push (Pub/Sub delivers to a webhook endpoint) vs Pull (subscriber polls for messages). Push is better for low-latency; Pull for batch processing workflows.
- Dead letter queues: Messages that fail delivery N times go to a dead letter topic for manual inspection. Critical for reliability.
- Storage layer: Messages must be durably stored before ACK is returned to publisher. Replicated write to N nodes (like Kafka's ISR). Retention period configurable (Google Pub/Sub: 7 days max).
Q9: Design a Key-Value Store (like Bigtable or DynamoDB)
Data model: (rowKey, columnFamily, columnQualifier, timestamp) → value. Bigtable's model is a sorted, sparse multidimensional map.
Core design decisions:
- LSM tree (Log-Structured Merge tree): Writes go to an in-memory MemTable (sorted). Flushed to disk as immutable SSTables when MemTable fills. Background compaction merges SSTables, reclaiming deleted space and maintaining sort order. Gives excellent write throughput.
- Bloom filters: Each SSTable has a Bloom filter. On read, check Bloom filter first — if key definitely not in table, skip disk read. Reduces read amplification significantly.
- Row key design: Bigtable rows are sorted by row key — this means key design directly impacts scan performance. Poor key design (e.g., sequential timestamps as key prefix) creates "hotspot" tablets. Solution: hash prefix or reverse the timestamp.
- Tablets (shards): Row key space is divided into tablets. Each tablet served by one node. Tablet server assignment managed by a master. On node failure, tablets reassigned — with minimal impact because data is in GFS/GCS, not on the tablet server.
Q10: Design a Notification System
Key design decisions:
- Multi-channel delivery: Push notifications (iOS APNs, Android FCM), Email (SMTP via SendGrid/SES), SMS (Twilio), In-app notifications. Each channel has its own rate limits and reliability profile.
- Fan-out challenge: A celebrity posts — 50 million followers must be notified. Approaches: push fan-out (write to each follower's inbox immediately — fast but expensive), pull fan-out (followers pull a feed on login — cheap but stale), or hybrid (push to active users, pull for dormant).
- Deduplication: Users on multiple devices may receive duplicate notifications (one per device delivery, one per user notification). Use idempotency keys at the dispatch layer.
- Rate limiting per user: Never send more than N notifications per hour per user — protect from notification spam. Per-user token bucket.
- Delivery tracking: Sent → Delivered → Opened events. Stored as event log, used for engagement analytics and retry logic.
Q11: Design a Distributed Job Scheduler
Requirements clarification to make: One-time jobs vs recurring (cron)? Best-effort vs guaranteed execution? At-most-once vs at-least-once? Max job duration?
Core architecture:
- Job store: Persistent storage of job definitions and next-execution times. Indexed on next_run_time. Partitioned by time bucket for efficient querying.
- Leader election: Multiple scheduler nodes, one leader at a time (ZooKeeper or etcd distributed lock). Leader polls job store for jobs due in the next N seconds, assigns to workers.
- Worker pool: Stateless workers pull jobs from a work queue (Kafka/SQS). Heartbeat to leader — if worker dies mid-job, leader reassigns the job after timeout.
- Exactly-once semantics: Hard problem. Approach: optimistic locking on job execution row — worker claims job by updating status from SCHEDULED to RUNNING with its worker ID. If two workers race, only one wins the CAS. On job completion, update to COMPLETED.
- Monitoring: Job SLA violations (jobs not starting within X seconds of scheduled time), job failure rates, worker pool utilization — all must be observable.
Q12: Design Google Analytics (Real-time Event Tracking)
Key components:
- Event collection: Client-side JS beacon sends events (pageviews, clicks, custom events) to a lightweight collection endpoint. Use HTTP/1.1 POST with batching, or a pixel (1×1 GIF) for cross-domain tracking.
- Ingestion pipeline: Events land in Pub/Sub → consumed by a streaming processor (Dataflow/Flink). Real-time aggregation for live dashboards (sessions active now). Batch processing for historical reports.
- Lambda architecture: Speed layer (real-time approximate counts in Redis) + Batch layer (exact counts in BigQuery). Reports query the batch layer for historical data; dashboards query the speed layer for current data.
- Session reconstruction: Session = sequence of events from same user within 30-minute inactivity window. Implemented as stateful stream processing — maintain session state per user, flush when inactive.
- Sampling: At Google Analytics scale, processing every event is cost-prohibitive. Sampling at 10% (process 1 in 10 events, multiply by 10 in reports) maintains statistical accuracy at 1/10th the compute cost.
Q13: Design a Ticket Booking System (Flash Sale / High Contention)
The core challenge: 100,000 users simultaneously trying to book 500 remaining concert tickets. Must prevent overselling. Must be fast. Must be fair.
Key design decisions:
- Inventory reservation: Two-phase hold — (1) Optimistically reserve a seat (decrement inventory counter using Redis atomic DECR). (2) User completes payment within 5 minutes. If payment fails or times out, release the reservation. DECR is atomic — no race condition on the counter.
- Virtual waiting queue: During flash sale peak, route users to a virtual queue instead of directly to booking. Issue queue tokens, serve in FIFO order. This turns a thundering herd into a controlled stream.
- Idempotency: Double submit protection — if user submits payment twice, ensure only one booking is created. Idempotency key = unique checkout session ID.
- Database: For seat-level booking (specific seat in stadium), Row-level locking in PostgreSQL works fine at moderate scale. For large fan-out (generic category tickets — "any seat in Section A"), Redis counters are faster.
Q14: Design a Search Autocomplete System
The core data structure: Trie (Prefix Tree)
- Each node represents a character. Traversing from root to node spells out a prefix.
- At each node, store top-K most frequent completions (precomputed during index build).
- On query "go", traverse to "g" → "o", return stored top-K completions = ["google", "google maps", "gmail", ...]
At Google scale:
- Trie is too large for a single machine. Shard by first 2–3 characters of prefix (all "go*" queries go to one shard).
- Precomputation: Batch job daily aggregates search logs, computes top-K completions per prefix, rebuilds the trie. Real-time trending terms (e.g., breaking news) need a separate hot-update mechanism.
- Personalization: Blend global top-K with user's personal search history using a personalization layer after trie lookup.
- Latency SLA: Autocomplete must respond in <100ms to feel instant. Trie lookups are O(prefix length) — very fast. Network latency to the trie service is the dominant cost.
Q15: Design a Distributed Cache (like Memcached at Google Scale)
Why not just use Redis? At Google scale (billions of QPS), you need custom solutions. Facebook's Memcached paper is the canonical reference — worth reading.
Key design decisions:
- Consistent hashing for sharding: Client-side consistent hashing determines which cache node holds a given key. Adding/removing nodes only rebalances a fraction of keys.
- Thundering herd on cache miss: If a popular key expires, thousands of simultaneous requests miss and all query the DB. Solutions: (1) Probabilistic early expiry (slowly expire before actual TTL), (2) Cache locking (only one request rebuilds; others wait), (3) Stale-while-revalidate (serve stale content while one background request refreshes).
- Write policy: Cache-aside (application manages cache) is standard. Write-through adds latency. Write-behind risks data inconsistency.
- Invalidation: On DB write, invalidate (delete) the corresponding cache key. On next read, cache miss → DB read → repopulate. Never write directly to cache on mutation — this causes a race condition between invalidation and repopulation.
How to Prepare for Google System Design
- Read the Google Systems Papers: MapReduce, Bigtable, Spanner, Chubby, Borg, Google File System. These are publicly available — Google expects SDE-3 candidates to be familiar with them.
- Practice the structured approach: Requirements → Scale estimation → High-level design → Deep dive → Trade-offs. Time yourself — you have 60 minutes.
- Think in Google's vocabulary: GCS (not S3), Bigtable/Spanner (not DynamoDB/RDS), Pub/Sub (not Kafka), Dataflow (not Spark), Borg/GKE (not ECS/EKS). Mapping to Google's stack signals familiarity.
- Practice out loud: System design is a spoken exercise. Use MockExperts' AI mock interview platform to simulate full 60-minute Google system design rounds with interactive probing and feedback.
📋 Legal Disclaimer
Educational Purpose: This article is published solely for educational and informational purposes to help candidates prepare for technical interviews. It does not constitute professional career advice, legal advice, or recruitment guidance.
Nominative Fair Use of Trademarks: Company names, product names, and brand identifiers (including but not limited to Google, Meta, Amazon, Goldman Sachs, Bloomberg, Pramp, OpenAI, Anthropic, and others) are referenced solely to describe the subject matter of interview preparation. Such use is permitted under the nominative fair use doctrine and does not imply sponsorship, endorsement, affiliation, or certification by any of these organisations. All trademarks and registered trademarks are the property of their respective owners.
No Proprietary Question Reproduction: All interview questions, processes, and experiences described herein are based on community-reported patterns, publicly available candidate feedback, and general industry knowledge. MockExperts does not reproduce, distribute, or claim ownership of any proprietary assessment content, internal hiring rubrics, or confidential evaluation criteria belonging to any company.
No Official Affiliation: MockExperts is an independent AI-powered interview preparation platform. We are not officially affiliated with, partnered with, or approved by Google, Meta, Amazon, Goldman Sachs, Bloomberg, Pramp, or any other company mentioned in our content.