TiDB Study Registration

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB学习报到

| username: TiDBer_dEcpBgBs

How can I quickly get started with learning TiDB?

| username: Billmay表妹 | Original post link

1. TiDB Basic Architecture

1.1 Articles on TiDB Architecture and Basic Principles:

2. Advanced Learning Materials

2.1 Kernel

The TiDB kernel includes all components of TiDB and TiKV. Participants can improve TiDB’s performance, stability, usability, etc., in the project.

2.1.1 TiDB

Module Analysis

TiDB is the SQL layer of the cluster, responsible for client communication (protocol layer), syntax parsing (SQL Parser), query optimization (Optimizer), and executing query plans.

Protocol Layer

TiDB’s protocol is compatible with MySQL. For details on the MySQL protocol, refer to MySQL Protocol.

SQL Parser

TiDB’s Parser is divided into two parts: the Lexer, written manually in Golang, and the Parser, implemented using goyacc. For implementation details, read TiDB Source Code Reading Series (5): TiDB SQL Parser Implementation. For a deeper understanding of yacc syntax, read “flex & bison”.

Schema Management & DDL

TiDB implements Google F1’s online schema change algorithm. For TiDB’s implementation, refer to TiDB’s Asynchronous Schema Change Implementation. For source code analysis, see TiDB Source Code Reading Series (17): DDL Source Code Analysis.

Expressions

This article Refactoring Built-in Functions for TiDB describes how to rewrite or add built-in functions for TiDB using the new computing framework.

SQL Optimizer

Most of the SQL optimizer logic is in the plan package, along with statistics and range calculation modules. Statistics are described in the next chapter. Range calculations are in the ranger package. Refer to the following articles:

Execution Engine

The executor consists of two parts: one on the TiDB side and the other on the TiKV (or Mock-TiKV) side. Mock-TiKV is mainly used for unit testing and partially implements TiKV logic.

This part strictly processes data according to the requirements of physical operators to produce results. The article MPP and SMP in TiDB introduces some of the executor’s architecture. For source code analysis, refer to:

TiKV Client

The TiKV Client is the module in TiDB responsible for interacting with TiKV, including two-phase commit and Coprocessor interactions. Currently, there are two language versions of the TiKV Client: Golang, located at store/tikv, and Java, located at tikv-client-lib-java, mainly used for the TiSpark project. The Rust-client is available at client-rust, though its functionality is still limited.

For source code analysis, refer to:

Distributed Transactions

TiDB supports distributed transactions, based on the Percolator model, an optimized 2PC algorithm. For TiDB’s implementation, refer to Transaction in TiDB. The original Percolator is an optimistic transaction algorithm. In version 3.0, TiDB introduced pessimistic transactions (experimental). For implementation details, refer to New Features in TiDB: Pessimistic Transactions.

For TiDB’s transaction logic, refer to TiDB Source Code Reading Series (19): tikv-client (Part 2).

Main Read/Write Code Flow in TiDB

https://asktug.com/t/topic/752793

Common Issues with TiDB (Local Deployment, Performance Tuning, etc.)
Testing Cluster Performance

How to Test TiDB with Sysbench

How to Perform TPC-C Testing on TiDB

TiDB/TiKV Performance Tuning

Understanding TiDB Execution Plans introduces how to use the EXPLAIN statement to understand how TiDB executes a query.

Overview of SQL Optimization Processes introduces several optimizations that TiDB can use to improve query performance.

Controlling Execution Plans introduces how to control the generation of execution plans. When TiDB’s execution plan is not optimal, it is recommended to control the execution plan.

TiDB Memory Tuning

TiDB Performance Tuning (video)

TiKV Thread Pool Performance Tuning

Tuning Tools

To identify TiKV’s performance bottlenecks, we first need to use profiling tools. Here are two commonly used profiling tools for reference:

2.1.2 TiKV

TiKV is the underlying storage for TiDB. There is a series of articles for an in-depth introduction to TiKV principles: Deep Dive Into TiKV and TiKV Source Code Analysis Series.

TiKV can be divided into multiple layers, each with its own function, arranged from bottom to top:

  • RocksDB
  • Raft
  • Raft KV
  • MVCC
  • TXN KV
  • Coprocessor

RocksDB is a single-machine storage engine that stores all data in TiKV and provides a single-machine storage KV API. Details.

Raft is a consensus algorithm that represents the consistency layer in TiKV, ensuring state consistency among TiKV instances. It is responsible for copying data between TiKV replicas and is the cornerstone of TiDB’s high availability. Details.

RaftKV combines RocksDB and Raft to provide a distributed, strongly consistent KV API.

MVCC provides multi-version concurrency control API and transaction API. The MVCC layer implements multi-version and transactions by encoding keys with timestamps.

TXN KV combines RaftKV and MVCC to provide distributed transactions and multi-version concurrency control. TiDB calls its API.

The Coprocessor handles some operators pushed down by TiDB, performing part of the computation logic closer to the data. The Coprocessor is above the RaftKV and MVCC layers. TiDB converts queries into a DAG, which includes pushed-down expressions. The Coprocessor calculates data in TiKV based on these expressions and returns the results to TiDB.

Placement Driver (PD)

PD is a logical single point in the cluster, similar to master servers or meta servers in many systems. PD’s internal structure is a composite of various functions. For a brief introduction, refer to “Best Practices for PD Scheduling Strategies”. For source code-based introductions, see TiKV Source Code Analysis Series - Placement Driver and TiKV Source Code Analysis - PD Scheduler.

embed etcd

etcd is a distributed KV database based on Raft, which can be considered a simplified version of TiKV with only one Raft group and limited data storage.

PD uses etcd’s embed feature to embed etcd as a library within the same process and bind the HTTP and gRPC services required by PD to the ports listened to by etcd.

Multiple PDs use etcd’s interface to elect a leader to provide services. When the leader fails, other PDs elect a new leader to continue providing services.

Meta Management and Routing

PD manages metadata, including global configurations (clusterID, replica count, etc.), TiKV node registration and deregistration, and region creation.

When TiDB needs to access data in a region, it first queries the region’s status and location from PD. TiKV instances synchronize metadata to PD through heartbeats. PD caches the metadata locally and organizes it in memory for efficient routing services.

Timestamp Allocation

The distributed transaction model requires globally ordered timestamps. The PD leader is responsible for allocating TS, ensuring monotonic increment even during leader switches using etcd.

Scheduling

Scheduling mainly involves two aspects: replica management (replica placement) and load balancing (load rebalance). Replica management maintains the desired number of replicas and meets constraints such as isolating multiple replicas or distributing them to specific namespaces.

Load balancing adjusts the positions of region leaders or peers to balance the load. Various strategies are implemented to adapt to different business scenarios.

Consistency Protocol (Raft)

“Consistency” is a core issue in distributed systems. Since the proposal of the Paxos algorithm in 1990, message-passing consistency protocols have become mainstream, with Raft being one of them. Each node in a Raft cluster can be in one of three states: Leader, Follower, or Candidate. When a node starts, it enters the Follower state and elects a new Leader through an election process. Raft uses the term concept to avoid split-brain scenarios, ensuring that there is only one Leader per term. After a Leader is elected, all read and write requests to the Raft cluster go through the Leader. The write request process is called Log Replication. The Leader persists the client’s write request to its log and synchronizes the log to other replicas (Followers). The Leader confirms the write to the client only after receiving a majority of confirmations. The log changes are then applied to the state machine, a step called Apply. This mechanism also applies to modifying the Raft cluster configuration, such as adding or removing nodes. For read requests, the classic Raft implementation handles them the same way as write requests, but optimizations like read index and read lease can be applied.

TiKV uses the Raft protocol to encapsulate a KV engine on top of RocksDB. This engine synchronizes writes to multiple replicas, ensuring data consistency and availability even if some nodes fail.

Storage Engine (LSM Tree & RocksDB)

LSM tree stands for Log-Structured Merge tree, a data structure optimized for fast writes. All writes, including inserts, updates, and deletes, are append-only, sacrificing some read performance. This trade-off is based on the fact that sequential writes on mechanical disks are much faster than random writes. On SSDs, the difference between sequential and random writes is less significant, so the advantage is not as pronounced.

An LSM tree typically consists of a write-ahead log (WAL), memtable, and SST files. The WAL ensures that written data is not lost, with write operations first appending the content to the WAL. After writing to the WAL, the data is written to the memtable, usually implemented as a skip list for cache optimization and lock-free concurrent modifications. Once the memtable reaches a certain size, its contents are flushed to disk as SSTable files. SST files are compacted in the background to remove obsolete data.

LevelDB is a typical LSM tree implementation open-sourced by Google. RocksDB is a high-performance single-machine KV store developed by Facebook based on LevelDB. RocksDB adds advanced features like column families, delete range, multi-threaded compaction, prefix seek, and user-defined properties, significantly improving functionality and performance.

RocksDB (LevelDB) architecture is shown below, where the memtable is converted to an immutable memtable and dumped to disk as SSTable files when full.

RocksDB (LevelDB) Architecture

TiKV uses RocksDB for single-machine persistent storage. rust-rocksdb is a Rust wrapper for RocksDB, allowing TiKV to operate RocksDB.

Since RocksDB is an optimized version of LevelDB, most LevelDB features are retained. Reading LevelDB-related materials is helpful for understanding RocksDB.

Reference:

LevelDB is a KV storage engine open-sourced by Google’s legendary engineers Jeff Dean and Sanjay Ghemawat. Its design and code are exquisite and elegant, worth savoring. This blog series introduces LevelDB’s design and code details, covering design ideas, overall structure, read/write processes, and compression processes for a comprehensive understanding of LevelDB.

https://draveness.me/bigtable-leveldb

LSM Tree has write amplification and read amplification issues. On SSDs, reducing write amplification can improve write efficiency. Several papers optimize this:

Network Communication (gRPC)

gRPC is a high

| username: tidb菜鸟一只 | Original post link

Read the official documentation more, watch more videos, visit forums more often, and practice more. That’s it.