Questions and Confusions Encountered While Studying Database Source Code Recently, Seeking Help from Experienced Individuals

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 最近学习数据库源码 遇到的疑惑,希望过来人帮忙解答

| username: 程序员小王

In the past 2 years, whenever I wanted to look at data code,

Question 1: If I have 4-5 hours of free time a week, should I look at TiDB, OB, or Tegine? I have been hesitating because my goal is not clear.

  1. I can’t calm down to look at it at all, especially in the absence of documentation and understanding, I am even more reluctant to look at it. There are so many things that I don’t know where to start. In the end, I didn’t look at it.

It has been delayed for several months. I regret it when I think about it.
Question 2: The key is that the company doesn’t use it, so I haven’t planned my study in depth, just looking around.

From the number of documents and PR submissions: TiDB > Tegine > OB

| username: ealam_小羽 | Original post link

First, watch some videos in the course to understand some basic design and knowledge. Then, it might be better to look at the source code.

| username: ohammer | Original post link

First, pay homage to the expert.

| username: 托马斯滑板鞋 | Original post link

Here are two good recommendations in English:
MIT 6.830
CMU 15-445
P.S.: For the second one, you can search for other people’s study notes online to help you understand.

| username: 程序员小王 | Original post link

I understand: It doesn’t seem to be of much use. You can find answers to your questions by looking at the source code (including watching videos). You still need to figure out the specific details on your own.

| username: Billmay表妹 | Original post link

1. TiDB Basic Architecture

1.1 Articles on TiDB Architecture and Basic Principles:

2. Advanced Learning Materials

2.1 Kernel

The TiDB kernel includes all components of TiDB and TiKV. Participants can improve TiDB’s performance, stability, usability, etc., in the project.

2.1.1 TiDB

Module Analysis

TiDB is the SQL layer of the cluster, responsible for client communication (protocol layer), syntax parsing (SQL Parser), query optimization (Optimizer), and executing query plans.

Protocol Layer

TiDB’s protocol is compatible with MySQL. For details on the MySQL protocol, refer to MySQL Protocol.

SQL Parser

TiDB’s Parser is divided into two parts: the Lexer, written manually in Golang, and the Parser, implemented using goyacc. For an in-depth understanding of the Parser, read TiDB Source Code Reading Series (5): Implementation of TiDB SQL Parser. For a deeper understanding of yacc syntax, read “flex & bison”.

Schema Management & DDL

TiDB implements Google F1’s online schema change algorithm. For TiDB’s implementation, refer to Asynchronous Schema Change Implementation in TiDB. For source code analysis, refer to TiDB Source Code Reading Series (17): DDL Source Code Analysis.

Expressions

This article Refactoring Built-in Functions for TiDB describes how to rewrite or add built-in functions for TiDB using the new computing framework.

SQL Optimizer

Most of the SQL optimizer logic is in the plan package. Related modules include statistics and range calculation. Statistics are described in the next chapter. Range calculation is in the ranger package. Refer to the following articles:

Execution Engine

The executor consists of two parts: one on the TiDB side and the other on the TiKV (or Mock-TiKV) side. Mock-TiKV is mainly used for unit testing and partially implements TiKV logic.

This part strictly processes data according to the requirements of physical operators to produce results. The article MPP and SMP in TiDB introduces some of the executor’s architecture. For source code analysis, refer to:

TiKV Client

The TiKV Client is the module in TiDB responsible for interacting with TiKV, including two-phase commit and Coprocessor interactions. Currently, there are two versions of the TiKV Client: one in Golang, located in store/tikv.

Here are two simple examples of how to call the ticlient’s KV interface: benchkv, benchrawkv.

Another version is in Java, located in tikv-client-lib-java, mainly used for the TiSpark project.

Rust-client is in client-rust, currently not very feature-rich, but worth trying.

Source Code Analysis:
Distributed Transactions

TiDB supports distributed transactions, based on the Percolator model, an optimized 2PC algorithm. For TiDB’s implementation, refer to Transaction in TiDB. The original Percolator is an optimistic transaction algorithm. In version 3.0, TiDB introduced the pessimistic transaction (experimental) feature. For implementation details, refer to New Features in TiDB: Pessimistic Transactions.

For TiDB’s transaction logic, refer to TiDB Source Code Reading Series (19): tikv-client (Part 2).

TiDB Read/Write Code Main Process

https://asktug.com/t/topic/752793

Common Issues with TiDB (Local Deployment, Performance Tuning, etc.)
Testing Cluster Performance

How to Test TiDB with Sysbench

How to Perform TPC-C Testing on TiDB

TiDB/TiKV Performance Tuning

Understanding TiDB Execution Plans introduces how to use the EXPLAIN statement to understand how TiDB executes a query.

Overview of SQL Optimization Process introduces several optimizations that TiDB can use to improve query performance.

Controlling Execution Plans introduces how to control the generation of execution plans. When TiDB’s execution plan is not optimal, it is recommended to control the execution plan.

TiDB Memory Tuning

TiDB Performance Tuning (Video)

TiKV Thread Pool Performance Tuning

Tuning Tools

To identify TiKV’s performance bottlenecks, we first use profiling tools. Here are two commonly used profiling tools for reference:

2.1.2 TiKV

TiKV is the underlying storage for TiDB. Here is a series of articles that deeply introduce TiKV principles: Deep Dive Into TiKV and TiKV Source Code Analysis Series.

TiKV is internally divided into multiple layers, each with its own functions, arranged from bottom to top:

  • RocksDB
  • Raft
  • Raft KV
  • MVCC
  • TXN KV
  • Coprocessor

RocksDB is a single-node storage engine that stores all data in TiKV and provides a single-node storage KV API. Details.

Raft is a consensus algorithm representing the consistency layer in TiKV, ensuring state consistency among TiKV nodes. It is the cornerstone of TiDB’s high availability. Details.

RaftKV combines RocksDB and Raft, providing a distributed, strongly consistent KV API.

MVCC provides multi-version concurrency control API and transaction API. It achieves multi-version and transactions by encoding keys with timestamps.

TXN KV combines RaftKV and MVCC, providing distributed transactions and multi-version concurrency control. TiDB calls its API.

Coprocessor handles some operators pushed down by TiDB, performing part of the computation logic closer to the data. Coprocessor is above the RaftKV and MVCC layers. TiDB converts queries into a DAG containing pushed-down expressions, and the Coprocessor computes data in TiKV based on these expressions, returning the results to TiDB.

Placement Driver (PD)

PD is a logical single point in the cluster, similar to master servers or meta servers in many systems. PD’s internal structure is a composite of various functions. For a brief introduction, refer to “Best Practices for PD Scheduling Strategies”. For source code-based introductions, refer to TiKV Source Code Analysis Series - Placement Driver and TiKV Source Code Analysis - PD Scheduler.

embed etcd

etcd is a distributed KV database based on Raft, similar to a simplified version of TiKV with only one Raft group and limited data storage.

PD uses etcd’s embed feature to embed etcd as a library within the same process, binding the HTTP and gRPC services required by PD to the ports listened to by etcd.

Multiple PDs use etcd’s interface to elect a leader to provide services. When the leader fails, other PDs elect a new leader.

Meta Management and Routing

PD manages metadata, including global configurations (clusterID, replica count, etc.), TiKV node registration and deregistration, and region creation.

When TiDB needs to access a region’s data, it first queries PD for the region’s status and location. TiKV instances synchronize metadata to PD via heartbeats. PD caches the metadata locally and organizes it into a structure for efficient routing services.

Timestamp Allocation

The distributed transaction model requires globally ordered timestamps. The PD leader is responsible for allocating TS, ensuring monotonic increase even during leader switches via etcd.

Scheduling

Scheduling mainly involves two aspects: replica management and load balancing. Replica management maintains the desired number of replicas and satisfies constraints like isolating multiple replicas or distributing them to specific namespaces.

Load balancing adjusts region leader or peer positions to balance the load, with various strategies for different business scenarios.

Consistency Protocol (Raft)

“Consistency” is a core issue in distributed systems. Since the proposal of the Paxos algorithm in 1990, message-passing consistency protocols have become mainstream, with Raft being one of them. In a Raft cluster, each node can be in one of three states: Leader, Follower, or Candidate. When a node starts, it enters the Follower state and elects a new Leader through Leader Election. Raft uses the term concept to avoid split-brain scenarios, ensuring only one Leader per term. Once a Leader is elected, all read and write requests go through the Leader. The write request process is called Log Replication. The Leader persists the client’s write request to its log and synchronizes the log with other replicas (Followers). The Leader confirms the write to the client only after receiving confirmation from the majority. Then, Raft applies the log changes to its state machine, called Apply. This mechanism also applies to modifying the Raft cluster configuration, such as adding or removing nodes. For read requests, classic Raft handles them the same as write requests, but optimizations like read index and read lease are possible.

TiKV uses the Raft protocol to encapsulate a KV engine on top of RocksDB. This engine synchronizes writes to multiple replicas, ensuring data consistency and availability even if some nodes fail.

Storage Engine (LSM Tree & RocksDB)

LSM tree stands for Log-Structured Merge tree, a data structure optimized for fast writes. All writes, including inserts, updates, and deletes, are append-only, trading off some read performance. This trade-off is based on the fact that sequential writes on mechanical disks are much faster than random writes. On SSDs, the difference between sequential and random writes is less significant, so the advantage is less pronounced.

An LSM tree typically consists of a write-ahead log (WAL), memtable, and SST files. The WAL ensures data is not lost, with writes first appended to the WAL. After writing to the WAL, the data is written to the memtable, usually implemented as a skip list for cache optimization and lock-free concurrent modifications. Once the memtable reaches a certain size, its contents are flushed to disk as SSTables. SSTables are compacted in the background to remove obsolete data.

LevelDB is a typical LSM tree implementation open-sourced by Google. RocksDB is a high-performance single-node KV store developed by Facebook based on LevelDB, adding advanced features like column families, delete range, multi-threaded compaction, prefix seek, and user-defined properties, significantly improving functionality and performance.

RocksDB (LevelDB) architecture is shown below, with the memtable converted to an immutable memtable when full, waiting to be dumped to disk.

RocksDB (LevelDB) Architecture

TiKV uses RocksDB for single-node persistent storage. rust-rocksdb is a Rust wrapper for RocksDB, allowing TiKV to operate RocksDB.

Since RocksDB is an optimized version of LevelDB, most LevelDB features are retained. Reading LevelDB-related materials helps understand RocksDB.

Reference:

LevelDB is a KV storage engine open-sourced by Google’s legendary engineers Jeff Dean and Sanjay Ghemawat. Its design and code are exquisite and elegant, worth savoring. The following blogs introduce LevelDB’s design and code details, covering design ideas, overall structure, read/write processes, and compaction processes for a comprehensive understanding of LevelDB.

https://draveness.me/bigtable-leveldb

LSM Tree has write amplification and read amplification issues. On SSDs, reducing write amplification improves write efficiency. Several papers optimize this:

Network Communication (gRPC)

[gRPC](https

| username: TiDBer_jYQINSnf | Original post link

Why can I mark the best answer if it’s not my question? Is it a bug? But this answer from my cousin is indeed very comprehensive. :grin:

| username: Billmay表妹 | Original post link

Congratulations on reaching the level to unlock the “Best Answer” marking!
This is a special privilege for long-term community helpers~
You can also participate in community management now~

| username: TiDBer_jYQINSnf | Original post link

I just genuinely wanted to give a thumbs up, and it actually worked :grin:. What other hidden skills do I have :smile:

| username: Billmay表妹 | Original post link

Waiting for you to explore~

| username: 程序员小王 | Original post link

In 2020, I participated in writing a book on TiDB distributed systems and a performance competition. I was busy looking for a job and got delayed by other things for two years. I need to catch up, I’ve fallen behind. So much material has been produced since then.