This repository contains notes, diagrams, and code snippets created while learning system design concepts. It's intended as a personal reference for revision and interview preparation. Topics include high-level design patterns, scalability principles, common system architectures, and more.
- Concepts
- Additional Concepts
- System Design Problems
- Resources
- Articles
- How to Use This Repo
- Contributing
- Glossary
- Key Differences
- Scalability: Basics
- Load Balancing
- Databases
- Consistency in Distributed Systems
- Caches
- Networks
- Data Replication and Migration
- Security in Distributed Systems
- Observability in Distributed Systems
- Distributed Consensus
- Rate Limiting
- Tradeoffs
- Email Service
- Tinder Design - Recommendation
- Whatsapp Design - Chat Application
- Google Docs - Collaborative Editor Design
- Uber - Cab Aggregator App
- Workflow Management and Recommendation
- Whatsapp Calling
- Design a Rate Limiter
- Design a Consistent Hashing
- Design a URL shortener
- Design a Key Value Store
- Design an Unique ID generator
- Design a Web Crawler
- Design a Notification System
- Design a News Feed System
A collection of materials referred to while learning:
- System Design Course by Gaurav Sen
- System Design Primer
- System Design Interview - Alex Xu (Part 1)
- Browse by topic directories or use the table of contents above.
- Diagrams are stored under
/diagramsand topic-specific folders. - Use markdown files as revision notes before interviews.
- Refer to example architectures for designing systems in projects.
This is a personal learning repo, but feel free to open issues or PRs if you spot something helpful to add.
- Latency
- Throughput
- CAP Theorem (Brewer's Theorem)
- ACID Properties
- Consistent Hashing
- Sharding
- FFMPEG
- AVI
- Homebrew
- HLS
- DASH
- PSTN (Public Switched Telephone Network)
- VoIP (Voice over Internet Protocol)
- SIP (Session Initiation Protocol)
- NTP (Network Time Protocol)
- Redis Lua Scripts
Latency refers to the time it takes for a request to travel from its point of origin to its destination and receive a response. It combines a number of delays - Response times, transmission, and processing time.
The rate at which a system, process, or network can move data or carry out operations in a particular period of time is referred to as throughput. Bits per second (bps), bytes per second, transactions per second, etc. are common units of measurement. It is computed by dividing the total number of operations or objects executed by the time taken.
In a distributed computer system, you can only support two of the following guarantees:
Consistency- Every read receives the most recent write or an error (All nodes see the same data at the same time.)Availability- Every request receives a response, without guarantee that it contains the most recent version of the information (The system is always responsive.)Partition Tolerance- The system continues to operate despite network failures or delays. (No data loss or crash due to communication break between nodes.)
Networks aren't reliable, so you'll need to support partition tolerance. A distributed system must be partition-tolerant (P). So, a practical tradeoff is between Consistency (C) and Availability (A).
- Waiting for a response from the partitioned node might result in a timeout error.
- CP is a good choice if your business needs require atomic reads and writes.
- Banking systems prefer CP
- Responses return the most readily available version of the data available on any node, which might not be the latest.
- Writes might take some time to propagate when the partition is resolved.
- social media feeds prefer AP.
- Example
- If there are 3 servers (n1, n2, n3) and a client wants to write a key-value pair (k1, v1). Lets say k1 is written to n3, but not propagated yet to n1 and n2.
- Now, if n3 goes down and a client wants to read k1.
- In
CPsystem, the client will not get a response because it cannot guarantee consistency. It will wait until n3 is back up. - In
APsystem, the client will get a response from n1 or n2, but it may not be the most recent value of k1.
- In
- For a key-value store, we can choose either
CPorAPbased on the use case. - Choosing between AP and CP depends on the specific requirements of the application.
- For example:
- A banking application would prefer
CPto ensure data consistency. - A social media application would prefer
APto ensure availability.
- A banking application would prefer
ACID is a set of properties that ensure reliable, consistent, and safe transactions in a database system.
| Property | Ensures... | Example | Use Case |
|---|---|---|---|
| Atomicity | All or none execution | Debit fails → entire txn rolled back | Bank transfers |
| Consistency | Valid state transitions | No violation of constraints | E-commerce inventory updates |
| Isolation | No interference between transactions | Prevents dirty reads/race conditions | Online booking systems |
| Durability | Changes survive system failures | Data saved after commit | Order confirmations in retail apps |
Consistent Hashing maps servers to the key space and assigns requests(mapped to relevant buckets, called load) to the next clockwise server. Servers can then store relevant request data in them while allowing the system flexibility and scalability.
-
Database sharding is a horizontal scaling technique used to split a large database into smaller, independent pieces called shards.
-
These shards are then distributed across multiple servers or nodes, each responsible for handling a specific subset of the data.
-
The sharding process involves several key components including:
- Sharding Key: The shard key is a unique identifier used to determine which shard a particular piece of data belongs to. It can be a single column or a combination of columns.
- Data Partitioning: Partitioning involves splitting the data into shards based on the shard key. Each shard represents a portion of the total data set. Common strategies to partition database are range-based, hash-based, and directory-based sharding.
- Shard Mapping: Creating a mapping of shard keys to shard locations.
- Shard Management: The shard manager oversees the distribution of data across shards, ensuring data consistency and integrity.
- Query Routing: The query router intercepts incoming queries and directs them to the appropriate shard(s) based on the shard key.
-
Sharding Strategies
- Hash-Based Sharding: Data is distributed using a hash function, which maps data to a specific shard.
- Range-Based Sharding: Data is distributed based on a range of values, such as dates or numbers.
- Geo-Based Sharding: Data is distributed based on geographic location, allowing for improved performance and reduced latency for users in different regions.
- Directory-Based Sharding: A lookup table is used to map data to specific shards.
FFmpeg is a powerful, free, and open-source multimedia framework used for decoding, encoding, transcoding, and streaming audio and video files.
AVI, or Audio Video Interleave, is a multimedia container format developed by Microsoft to store both audio and video data in a single file, allowing for synchronized playback.
Homebrew is a package manager for macOS (and Linux) that simplifies the installation, updating, and management of software on your system.
- HLS, or HTTP Live Streaming, is a widely used video streaming protocol developed by Apple that delivers audio and video content over the internet.
- It's known for its adaptability to changing network conditions, reliability, and compatibility with various devices and browsers.
- HLS works by breaking down video files into smaller segments, which are then downloaded and played sequentially by a video player.
- Dynamic Adaptive Streaming over HTTP (DASH) is a streaming technology that adapts video quality in real-time based on network conditions, delivering high-quality video over the internet using standard HTTP servers.
- It achieves this by breaking down video content into segments, each with varying bitrates, and allowing the client device to dynamically select the most appropriate segment for the current network bandwidth
- Traditional circuit-switched telephone network.
- Requires telephone lines and dedicated bandwidth per call.
- Cost depends on distance and duration.
- Generally reliable but expensive.
- Example: Landline phone systems.
- Transmits voice calls over the internet using packet-switching.
- Requires only internet connectivity (no phone lines).
- Cost-effective and often free.
- Scalable and flexible, but quality depends on internet speed.
- Example: Skype, Zoom, WhatsApp calls.
- Signaling protocol used to establish, manage, and terminate VoIP calls.
- Handles call setup, teardown, and modifications.
- Works with other protocols (e.g., RTP for media transmission).
- Used for voice, video, and messaging sessions over IP.
- Example: Used in VoIP softphones and enterprise telephony systems.
- Protocol for synchronizing clocks of computer systems over packet-switched, variable-latency data networks.
- Uses a hierarchical system of time sources, with primary servers connected to highly accurate reference clocks.
- Ensures accurate timekeeping for applications like logging, authentication, and scheduling.
- Example: Synchronizing server times in data centers.
- Lua = lightweight, high-performance, embeddable scripting language.
- Designed to be fast, portable, and simple.
- Often used in:
- Game engines (e.g., Roblox, World of Warcraft mods).
- Embedded systems.
- Databases (e.g., Redis) for scripting logic.
- In System Design, Redis supports Lua scripting to run custom logic atomically inside Redis.
- Instead of multiple round-trips (client ↔ Redis), you can:
- Upload Lua script.
- Execute on server in one step.
- Why Lua Scripts:
- Ensures atomicity: script executes as a single operation.
- Reduce network overhead: One call instead of multiple client-server requests.
- Example:
- Say you want to check if a key exists, and if not, set it.
- Without Lua:
- Client does GET key.
- If not exists → does SET key value.
- Problem: race condition (another client may set in between).
- With Lua:
if redis.call("EXISTS", KEYS[1]) == 0 then return redis.call("SET", KEYS[1], ARGV[1]) else return nil end
- Call from client
EVAL "if redis.call('EXISTS', KEYS[1]) == 0 then return redis.call('SET', KEYS[1], ARGV[1]) else return nil end" 1 myKey myValue