1. TiDB Basic Architecture
1.1 Articles on TiDB Architecture and Basic Principles:
2. Advanced Learning Materials
2.1 Kernel
The TiDB kernel includes all components of TiDB and TiKV. Participants can improve TiDB’s performance, stability, usability, etc., in the project.
2.1.1 TiDB
Module Analysis
TiDB is the SQL layer of the cluster, responsible for client communication (protocol layer), syntax parsing (SQL Parser), query optimization (Optimizer), and executing query plans.
Protocol Layer
TiDB’s protocol is compatible with MySQL. For details on the MySQL protocol, refer to MySQL Protocol.
SQL Parser
TiDB’s Parser is divided into two parts: the Lexer, written manually in Golang, and the Parser, implemented using goyacc. For an in-depth understanding of the Parser, read TiDB Source Code Reading Series (5): Implementation of TiDB SQL Parser. For a deeper understanding of yacc syntax, read “flex & bison”.
Schema Management & DDL
TiDB implements Google F1’s online schema change algorithm. For TiDB’s implementation, refer to Asynchronous Schema Change Implementation in TiDB. For source code analysis, refer to TiDB Source Code Reading Series (17): DDL Source Code Analysis.
Expressions
This article Refactoring Built-in Functions for TiDB describes how to rewrite or add built-in functions for TiDB using the new computing framework.
SQL Optimizer
Most of the SQL optimizer logic is in the plan package. Related modules include statistics and range calculation. Statistics are described in the next chapter. Range calculation is in the ranger package. Refer to the following articles:
Execution Engine
The executor consists of two parts: one on the TiDB side and the other on the TiKV (or Mock-TiKV) side. Mock-TiKV is mainly used for unit testing and partially implements TiKV logic.
This part strictly processes data according to the requirements of physical operators to produce results. The article MPP and SMP in TiDB introduces some of the executor’s architecture. For source code analysis, refer to:
TiKV Client
The TiKV Client is the module in TiDB responsible for interacting with TiKV, including two-phase commit and Coprocessor interactions. Currently, there are two versions of the TiKV Client: one in Golang, located in store/tikv.
Here are two simple examples of how to call the ticlient’s KV interface: benchkv, benchrawkv.
Another version is in Java, located in tikv-client-lib-java, mainly used for the TiSpark project.
Rust-client is in client-rust, currently not very feature-rich, but worth trying.
Source Code Analysis:
Distributed Transactions
TiDB supports distributed transactions, based on the Percolator model, an optimized 2PC algorithm. For TiDB’s implementation, refer to Transaction in TiDB. The original Percolator is an optimistic transaction algorithm. In version 3.0, TiDB introduced the pessimistic transaction (experimental) feature. For implementation details, refer to New Features in TiDB: Pessimistic Transactions.
For TiDB’s transaction logic, refer to TiDB Source Code Reading Series (19): tikv-client (Part 2).
TiDB Read/Write Code Main Process
https://asktug.com/t/topic/752793
Common Issues with TiDB (Local Deployment, Performance Tuning, etc.)
Testing Cluster Performance
How to Test TiDB with Sysbench
How to Perform TPC-C Testing on TiDB
TiDB/TiKV Performance Tuning
Understanding TiDB Execution Plans introduces how to use the EXPLAIN statement to understand how TiDB executes a query.
Overview of SQL Optimization Process introduces several optimizations that TiDB can use to improve query performance.
Controlling Execution Plans introduces how to control the generation of execution plans. When TiDB’s execution plan is not optimal, it is recommended to control the execution plan.
TiDB Memory Tuning
TiDB Performance Tuning (Video)
TiKV Thread Pool Performance Tuning
Tuning Tools
To identify TiKV’s performance bottlenecks, we first use profiling tools. Here are two commonly used profiling tools for reference:
2.1.2 TiKV
TiKV is the underlying storage for TiDB. Here is a series of articles that deeply introduce TiKV principles: Deep Dive Into TiKV and TiKV Source Code Analysis Series.
TiKV is internally divided into multiple layers, each with its own functions, arranged from bottom to top:
- RocksDB
- Raft
- Raft KV
- MVCC
- TXN KV
- Coprocessor
RocksDB is a single-node storage engine that stores all data in TiKV and provides a single-node storage KV API. Details.
Raft is a consensus algorithm representing the consistency layer in TiKV, ensuring state consistency among TiKV nodes. It is the cornerstone of TiDB’s high availability. Details.
RaftKV combines RocksDB and Raft, providing a distributed, strongly consistent KV API.
MVCC provides multi-version concurrency control API and transaction API. It achieves multi-version and transactions by encoding keys with timestamps.
TXN KV combines RaftKV and MVCC, providing distributed transactions and multi-version concurrency control. TiDB calls its API.
Coprocessor handles some operators pushed down by TiDB, performing part of the computation logic closer to the data. Coprocessor is above the RaftKV and MVCC layers. TiDB converts queries into a DAG containing pushed-down expressions, and the Coprocessor computes data in TiKV based on these expressions, returning the results to TiDB.
Placement Driver (PD)
PD is a logical single point in the cluster, similar to master servers or meta servers in many systems. PD’s internal structure is a composite of various functions. For a brief introduction, refer to “Best Practices for PD Scheduling Strategies”. For source code-based introductions, refer to TiKV Source Code Analysis Series - Placement Driver and TiKV Source Code Analysis - PD Scheduler.
embed etcd
etcd is a distributed KV database based on Raft, similar to a simplified version of TiKV with only one Raft group and limited data storage.
PD uses etcd’s embed feature to embed etcd as a library within the same process, binding the HTTP and gRPC services required by PD to the ports listened to by etcd.
Multiple PDs use etcd’s interface to elect a leader to provide services. When the leader fails, other PDs elect a new leader.
Meta Management and Routing
PD manages metadata, including global configurations (clusterID, replica count, etc.), TiKV node registration and deregistration, and region creation.
When TiDB needs to access a region’s data, it first queries PD for the region’s status and location. TiKV instances synchronize metadata to PD via heartbeats. PD caches the metadata locally and organizes it into a structure for efficient routing services.
Timestamp Allocation
The distributed transaction model requires globally ordered timestamps. The PD leader is responsible for allocating TS, ensuring monotonic increase even during leader switches via etcd.
Scheduling
Scheduling mainly involves two aspects: replica management and load balancing. Replica management maintains the desired number of replicas and satisfies constraints like isolating multiple replicas or distributing them to specific namespaces.
Load balancing adjusts region leader or peer positions to balance the load, with various strategies for different business scenarios.
Consistency Protocol (Raft)
“Consistency” is a core issue in distributed systems. Since the proposal of the Paxos algorithm in 1990, message-passing consistency protocols have become mainstream, with Raft being one of them. In a Raft cluster, each node can be in one of three states: Leader, Follower, or Candidate. When a node starts, it enters the Follower state and elects a new Leader through Leader Election. Raft uses the term concept to avoid split-brain scenarios, ensuring only one Leader per term. Once a Leader is elected, all read and write requests go through the Leader. The write request process is called Log Replication. The Leader persists the client’s write request to its log and synchronizes the log with other replicas (Followers). The Leader confirms the write to the client only after receiving confirmation from the majority. Then, Raft applies the log changes to its state machine, called Apply. This mechanism also applies to modifying the Raft cluster configuration, such as adding or removing nodes. For read requests, classic Raft handles them the same as write requests, but optimizations like read index and read lease are possible.
TiKV uses the Raft protocol to encapsulate a KV engine on top of RocksDB. This engine synchronizes writes to multiple replicas, ensuring data consistency and availability even if some nodes fail.
Storage Engine (LSM Tree & RocksDB)
LSM tree stands for Log-Structured Merge tree, a data structure optimized for fast writes. All writes, including inserts, updates, and deletes, are append-only, trading off some read performance. This trade-off is based on the fact that sequential writes on mechanical disks are much faster than random writes. On SSDs, the difference between sequential and random writes is less significant, so the advantage is less pronounced.
An LSM tree typically consists of a write-ahead log (WAL), memtable, and SST files. The WAL ensures data is not lost, with writes first appended to the WAL. After writing to the WAL, the data is written to the memtable, usually implemented as a skip list for cache optimization and lock-free concurrent modifications. Once the memtable reaches a certain size, its contents are flushed to disk as SSTables. SSTables are compacted in the background to remove obsolete data.
LevelDB is a typical LSM tree implementation open-sourced by Google. RocksDB is a high-performance single-node KV store developed by Facebook based on LevelDB, adding advanced features like column families, delete range, multi-threaded compaction, prefix seek, and user-defined properties, significantly improving functionality and performance.
RocksDB (LevelDB) architecture is shown below, with the memtable converted to an immutable memtable when full, waiting to be dumped to disk.
RocksDB (LevelDB) Architecture
TiKV uses RocksDB for single-node persistent storage. rust-rocksdb is a Rust wrapper for RocksDB, allowing TiKV to operate RocksDB.
Since RocksDB is an optimized version of LevelDB, most LevelDB features are retained. Reading LevelDB-related materials helps understand RocksDB.
Reference:
LevelDB is a KV storage engine open-sourced by Google’s legendary engineers Jeff Dean and Sanjay Ghemawat. Its design and code are exquisite and elegant, worth savoring. The following blogs introduce LevelDB’s design and code details, covering design ideas, overall structure, read/write processes, and compaction processes for a comprehensive understanding of LevelDB.
https://draveness.me/bigtable-leveldb
LSM Tree has write amplification and read amplification issues. On SSDs, reducing write amplification improves write efficiency. Several papers optimize this:
Network Communication (gRPC)
[gRPC](https