Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: TiDB 学习资料大全
1. TiDB Basic Architecture
1.1 Articles on TiDB Architecture and Basic Principles:
- TiDB Architecture
- Storage
- Computing
- Scheduling
- TiDB Source Code Reading Series (2): Introduction to TiDB Source Code
- TiDB Source Code Reading Series (3): The Life of an SQL
2. Advanced Learning Materials
2.1 Kernel
The TiDB kernel includes all components of TiDB and TiKV. Participants can improve TiDB’s performance, stability, and usability in the project.
2.1.1 TiDB
Module Analysis
TiDB is the SQL layer of the cluster, responsible for client communication (protocol layer), syntax parsing (SQL Parser), query optimization (Optimizer), and executing query plans.
Protocol Layer
TiDB’s protocol is compatible with MySQL. For details, refer to MySQL Protocol.
SQL Parser
TiDB’s Parser is divided into two parts: Lexer, written manually in Golang, and Parser, implemented using goyacc. For implementation details, read TiDB Source Code Reading Series (5): TiDB SQL Parser Implementation. For a deeper understanding of yacc syntax, read flex & bison.
Schema Management & DDL
TiDB implements Google F1’s online schema change algorithm. For TiDB’s implementation, refer to Asynchronous Schema Change Implementation in TiDB. For source code analysis, see TiDB Source Code Reading Series (17): DDL Source Code Analysis.
Expressions
This article Refactoring Built-in Functions for TiDB describes how to rewrite or add built-in functions using the new computing framework for TiDB.
SQL Optimizer
Most of the SQL optimizer logic is in the plan package. Related modules include statistics and range calculation. Statistics are described in the next chapter. Range calculation is in the ranger package. Refer to the following articles:
- TiDB Source Code Reading Series (6): Overview of Select Statements
- TiDB Source Code Reading Series (7): Rule-Based Optimization
- TiDB Source Code Reading Series (21): Rule-Based Optimization II
- TiDB Source Code Reading Series (8): Cost-Based Optimization
- TiDB Source Code Reading Series (12): Statistics (Part 1)
- TiDB Source Code Reading Series (13): Introduction to Index Range Calculation
- TiDB Source Code Reading Series (14): Statistics (Part 2)
Execution Engine
The executor consists of two parts: one on the TiDB side and the other on the TiKV (or Mock-TiKV) side. Mock-TiKV is mainly used for unit testing and partially implements TiKV logic.
This part strictly processes data according to physical operators’ requirements to produce results. The article MPP and SMP in TiDB introduces some of the executor’s architecture. For source code analysis, refer to:
- TiDB Source Code Reading Series (10): Introduction to Chunk and Execution Framework
- TiDB Source Code Reading Series (9): Hash Join
- TiDB Source Code Reading Series (11): Index Lookup Join
- TiDB Source Code Reading Series (15): Sort Merge Join
- TiDB Source Code Reading Series (22): Hash Aggregation
TiKV Client
The TiKV Client is the module in TiDB responsible for interacting with TiKV, including two-phase commit and Coprocessor interactions. Currently, there are two language versions of the TiKV Client: Golang, located at store/tikv, and Java, located at tikv-client-lib-java, mainly used for the TiSpark project. The Rust-client is available at client-rust, though its functionality is still limited.
Here are two simple examples of how to call the ticlient’s KV interface: benchkv and benchrawkv.
Source Code Analysis:
- TiDB Source Code Reading Series (18): tikv-client (Part 1)
- TiDB Source Code Reading Series (19): tikv-client (Part 2)
Distributed Transactions
TiDB supports distributed transactions, based on the Percolator model, an optimized 2PC algorithm. For TiDB’s implementation, refer to Transaction in TiDB. The original Percolator is an optimistic transaction algorithm. In version 3.0, TiDB introduced the experimental feature of pessimistic transactions. For implementation details, see New Features in TiDB: Pessimistic Transactions.
For TiDB’s transaction logic, refer to TiDB Source Code Reading Series (19): tikv-client (Part 2).
TiDB Read/Write Code Main Process
Common Issues with TiDB (Local Deployment, Performance Tuning, etc.)
Testing Cluster Performance
How to Test TiDB with Sysbench
How to Perform TPC-C Testing on TiDB
TiDB/TiKV Performance Tuning
Understanding TiDB Execution Plans explains how to use the EXPLAIN statement to understand how TiDB executes a query.
Overview of SQL Optimization Processes introduces several optimizations that TiDB can use to improve query performance.
Controlling Execution Plans explains how to control the generation of execution plans. If TiDB’s execution plan is not optimal, it is recommended to control the execution plan.
TiDB Performance Tuning (Video)
TiKV Thread Pool Performance Tuning
Tuning Tools
To identify TiKV’s performance bottlenecks, we first use profiling tools. Here are two commonly used profiling tools for reference:
- TiKV Flame Graph Tool
- VTune Hotspots Analysis Guide
2.1.2 TiKV
TiKV is the underlying storage for TiDB. Here is a series of articles that deeply introduce TiKV principles: Deep Dive Into TiKV and TiKV Source Code Analysis Series.
TiKV can be divided into multiple layers, each with its own functions, arranged from bottom to top:
- RocksDB
- Raft
- Raft KV
- MVCC
- TXN KV
- Coprocessor
RocksDB is a single-node storage engine that stores all data in TiKV and provides a single-node storage KV API. Details.
Raft is a consensus algorithm that represents the consistency layer in TiKV, ensuring state consistency among TiKV nodes. It is responsible for copying data between TiKV replicas and is the cornerstone of TiDB’s high availability. Details.
RaftKV combines RocksDB and Raft to provide a distributed, strongly consistent basic KV API.
MVCC provides multi-version concurrency control API and transaction API. The MVCC layer achieves multi-version and transaction by encoding keys with timestamps.
TXN KV combines RaftKV and MVCC to provide distributed transactions and multi-version concurrency control. TiDB calls its API.
Coprocessor handles some operators pushed down by TiDB, taking on part of the computing logic closer to the data. Coprocessor is above the RaftKV and MVCC layers. TiDB converts queries into a DAG, which includes pushed-down expressions. The Coprocessor calculates data in TiKV based on the expressions and returns the results to TiDB.
Placement Driver (PD)
PD is a logical single point in the cluster, similar to master servers or meta servers in many systems. PD’s internal structure is a composite of various functions. For a brief introduction, refer to Best Practices for PD Scheduling Strategies. For source code-based introductions, see TiKV Source Code Analysis Series - Placement Driver and TiKV Source Code Analysis - PD Scheduler.
embed etcd
etcd is a distributed KV database based on Raft, which can be considered a simplified version of TiKV with only one Raft group and limited data storage.
PD uses etcd’s embed feature to embed etcd as a library into the same process and bind the HTTP and gRPC services required by PD to the ports listened to by etcd.
Multiple PDs use etcd’s interface to elect a leader to provide services. When the leader fails, other PDs elect a new leader to continue providing services.
Meta Management and Routing
PD manages metadata, including global configurations (clusterID, replica count, etc.), TiKV node registration and deregistration, and region information creation.
When TiDB needs to access a region’s data, it first queries the region’s status and location from PD. TiKV instances synchronize metadata to PD through heartbeats. PD caches the metadata locally and organizes it into a structure for efficient querying to provide routing services.
Timestamp Allocation
The distributed transaction model requires globally ordered timestamps. The PD leader is responsible for allocating TS, ensuring monotonic increment even during leader switches through etcd.
Scheduling
Scheduling mainly involves two aspects: replica management (replica placement) and load balancing (load rebalance). Replica management maintains the desired number of replicas and satisfies constraints like isolating multiple replicas or distributing them to specific namespaces. Load balancing adjusts region leader or peer positions to balance the load, with various strategies for different business scenarios.
Consistency Protocol (Raft)
“Consistency” is a core issue in distributed systems. Since the proposal of the Paxos algorithm in 1990, message-passing-based consistency protocols have become mainstream, with Raft being one of them. In a Raft cluster, each node can be in one of three states: Leader, Follower, or Candidate. When a node starts, it enters the Follower state and elects a new Leader through Leader Election. Raft uses the concept of term to avoid split-brain scenarios, ensuring only one Leader per term. After a Leader is elected, all read and write requests go through the Leader. The write request process is called Log Replication. The Leader persists the client’s write request to its log and synchronizes the log to other replicas (Followers). The Leader confirms the write completion to the client only after receiving confirmations from the majority. Then, Raft applies the log changes to its state machine, a step called Apply. This mechanism also applies to modifying the Raft cluster configuration, such as adding or removing nodes. For read requests, the classic Raft implementation handles them similarly to write requests, but optimizations like read index and read lease can be applied.
TiKV uses the Raft protocol to encapsulate a KV engine on top of RocksDB. This engine synchronizes writes to multiple replicas, ensuring data consistency and availability even if some nodes fail.
Storage Engine (LSM Tree & RocksDB)
LSM tree stands for Log-Structured Merge tree, a data structure optimized for fast writes, including inserts, updates, and deletes. This is achieved at the cost of some read performance. The trade-off is based on the fact that sequential writes on mechanical disks are much faster than random writes. By making all writes sequential, high write efficiency can be achieved on mechanical disks. On SSDs, the difference between sequential and random writes is less significant, so the advantage is less pronounced.
An LSM tree typically consists of three parts: write-ahead log (WAL), memtable, and SST files. The WAL ensures that written data is not lost. Write operations first append the content to the WAL. After writing to the WAL, the next step is to write to the memtable, usually implemented as a skip list for cache optimization and lock-free concurrent modifications. Once the memtable is full, its content is flushed to disk in SSTable format. SST files are compacted in the background to remove obsolete data.
LevelDB is a typical LSM tree implementation open-sourced by Google. RocksDB is a high-performance single-node KV store developed by Facebook based on LevelDB. RocksDB adds advanced features like column families, delete range, multi-threaded compaction, prefix seek, and user-defined properties, significantly improving functionality and performance.
RocksDB (LevelDB) architecture is shown below, where the memtable becomes an immutable memtable after being filled and waits to be dumped to disk.
RocksDB (LevelDB) Architecture
TiKV uses RocksDB for single-node persistent storage. rust-rocksdb is a Rust wrapper for RocksDB, allowing TiKV to operate RocksDB through rust-rocksdb.
Since RocksDB is an optimized version of LevelDB, most LevelDB features are retained. Reading LevelDB-related materials is helpful for understanding RocksDB.
Reference:
LevelDB is a KV storage engine open-sourced by Google’s legendary engineers Jeff Dean and Sanjay Ghemawat. Its design and code are exquisite and elegant, worth savoring. The following blogs will introduce LevelDB’s design and code details from the design philosophy, overall structure, read/write process, and compaction process, providing an overall understanding of LevelDB.