Newcomer reporting, newbie seeking help

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 新人报道啦,菜鸟求帮助

| username: msx-yzu

I hope to gain more useful knowledge and skills in the community and learn more good things.

| username: tidb菜鸟一只 | Original post link

Welcome!

| username: msx-yzu | Original post link

Sure, let’s improve together!

| username: Billmay表妹 | Original post link

Welcome, welcome~ If you have any questions, feel free to post them, and everyone will help you take a look~

| username: msx-yzu | Original post link

Okay, thank you, beautiful lady~

| username: Billmay表妹 | Original post link

You can add me on WeChat: billmay. I’ll invite you to join the community group~

| username: msx-yzu | Original post link

Alright, added.

| username: 数据小黑 | Original post link

Video courses welcome you.

| username: waeng | Original post link

Welcome :clap:t2:

| username: TiDBer_FLjUDGex | Original post link

I hope to gain more useful knowledge and skills in the community.

| username: msx-yzu | Original post link

Sure, I will study diligently.

| username: msx-yzu | Original post link

Thank you~

| username: msx-yzu | Original post link

Let’s improve together!

| username: Billmay表妹 | Original post link

Here is some study material you can check out:

1. TiDB Basic Architecture

1.1 Articles on TiDB Architecture and Basic Principles:

2. Advanced Learning Materials

2.1 Kernel

The TiDB kernel includes all components of TiDB and TiKV. Participants can improve TiDB’s performance, stability, and usability in the project.

2.1.1 TiDB

Module Analysis

TiDB is the SQL layer of the cluster, responsible for communication with clients (protocol layer), syntax parsing (SQL Parser), query optimization (Optimizer), and executing query plans.

Protocol Layer

TiDB’s protocol is compatible with MySQL. For details on the MySQL protocol, refer to MySQL Protocol.

SQL Parser

TiDB’s Parser is divided into two parts: the Lexer, written manually in Golang, and the Parser, implemented using goyacc. For more details, read TiDB Source Code Reading Series (5): Implementation of TiDB SQL Parser. For a deeper understanding of yacc syntax, read “flex & bison”.

Schema Management & DDL

TiDB implements Google F1’s online schema change algorithm. For more details, refer to TiDB’s asynchronous schema change implementation. For source code analysis, refer to TiDB Source Code Reading Series (17): DDL Source Code Analysis.

Expressions

This article Refactoring Built-in Functions for TiDB describes how to rewrite or add built-in functions for TiDB using the new computing framework.

SQL Optimizer

Most of the SQL optimizer logic is in the plan package. Related modules include statistics and range calculation. Statistics are described in the next section. Range calculation is in the ranger package. Refer to the following articles:

Execution Engine

The executor consists of two parts: one part on the TiDB side and the other part in TiKV (or Mock-TiKV). Mock-TiKV is mainly used for unit testing and partially implements TiKV logic.

This part processes data strictly according to the requirements of physical operators and produces results. The article MPP and SMP in TiDB introduces some of the executor’s architecture. For source code analysis, refer to:

TiKV Client

The TiKV Client is the module in TiDB responsible for interacting with TiKV, including two-phase commit and Coprocessor interactions. Currently, there are two language versions of the TiKV Client: the Golang version, located in store/tikv, and the Java version, located in tikv-client-lib-java, mainly used for the TiSpark project. The Rust-client is in client-rust, which is not very feature-rich yet but worth trying.

Source Code Analysis:
Distributed Transactions

TiDB supports distributed transactions, based on the Percolator model, an optimized 2PC algorithm. For implementation details, refer to Transaction in TiDB. The original Percolator is an optimistic transaction algorithm. In version 3.0, TiDB introduced the experimental feature of pessimistic transactions, detailed in New Features in TiDB: Pessimistic Transactions.

For TiDB-side transaction logic, refer to TiDB Source Code Reading Series (19): tikv-client (Part 2).

TiDB Read/Write Code Main Process

https://asktug.com/t/topic/752793

Common Issues with TiDB (Local Deployment, Performance Tuning, etc.)
Testing Cluster Performance

How to Test TiDB with Sysbench

How to Perform TPC-C Testing on TiDB

TiDB/TiKV Performance Tuning

Understanding TiDB Execution Plans explains how to use the EXPLAIN statement to understand how TiDB executes a query.

Overview of SQL Optimization Process introduces several optimizations that TiDB can use to improve query performance.

Controlling Execution Plans explains how to control the generation of execution plans. If TiDB’s execution plan is not optimal, it is recommended to control the execution plan.

TiDB Memory Tuning

TiDB Performance Tuning (Video)

TiKV Thread Pool Performance Tuning

Tuning Tools

To identify TiKV performance bottlenecks, we first need to use profiling tools. Here are two commonly used profiling tools for reference:

2.1.2 TiKV

TiKV is the underlying storage for TiDB. Here is a series of articles that provide an in-depth introduction to TiKV principles: Deep Dive Into TiKV and TiKV Source Code Analysis Series.

TiKV can be divided into multiple layers, each with its own functions, arranged from bottom to top:

  • RocksDB
  • Raft
  • Raft KV
  • MVCC
  • TXN KV
  • Coprocessor

RocksDB is a single-node storage engine that stores all data in TiKV and provides a single-node storage KV API. For more details.

Raft is a consensus algorithm that represents the consistency layer in TiKV, ensuring that the states of different TiKV nodes are consistent. It is responsible for copying data between TiKV replicas and is the cornerstone of TiDB’s high availability. For more details.

RaftKV combines RocksDB and Raft to provide a distributed strongly consistent basic KV API.

MVCC, as the name suggests, provides a multi-version concurrency control API and a transaction API. The MVCC layer achieves multi-version and transaction by encoding keys with timestamps.

TXN KV combines RaftKV and MVCC to provide distributed transactions and multi-version concurrency control. TiDB calls its API.

The Coprocessor is responsible for processing some operators pushed down by TiDB, handling part of the computing logic closer to the data. The Coprocessor is above the RaftKV and MVCC layers. TiDB converts queries into a DAG, which contains pushed-down expressions. The Coprocessor calculates the data in TiKV based on the expressions and returns the results to TiDB.

Placement Driver (PD)

PD is a logical single point in the cluster, similar to the master server or meta server in many systems. PD’s internal structure is a composite of multiple functions. For a brief introduction, refer to “Best Practices for PD Scheduling Strategies”. For a source code-based introduction, refer to TiKV Source Code Analysis Series - Placement Driver and TiKV Source Code Analysis - PD Scheduler.

embed etcd

etcd is a distributed KV database based on Raft, which can be considered a simplified version of TiKV, with only one Raft group and limited data storage.

PD uses etcd’s embed feature to embed etcd as a library into the same process and bind the HTTP and gRPC services required by PD to the ports listened to by etcd.

Multiple PDs use etcd’s interface to elect a leader to provide services. When the leader fails, other PDs will elect a new leader to provide services.

Meta Management and Routing

PD manages metadata, including global configurations (clusterID, replica count, etc.), TiKV node registration and deregistration, and region creation.

When TiDB needs to access a region’s data, it first queries the region’s status and location from PD. Each TiKV instance synchronizes metadata to PD through heartbeats. PD caches the metadata locally and organizes it into a structure that is convenient for querying to provide routing services.

Timestamp Allocation

The distributed transaction model requires globally ordered timestamps. The PD leader is responsible for allocating TS, ensuring that TS remains monotonically increasing even in the event of leader switches.

Scheduling

Scheduling is divided into two aspects: replica management (replica placement) and load balancing (load rebalance). Replica management maintains the desired number of replicas (e.g., adding new replicas when a host fails) and meets certain constraints, such as ensuring multiple replicas are isolated or distributed to specific namespaces.

Load balancing adjusts the positions of region leaders or peers to balance the load. Various strategies are implemented to adapt to different business scenarios.

Consistency Protocol (Raft)

“Consistency” is a core issue in distributed systems. Since the proposal of the Paxos algorithm in 1990, message-passing-based consistency protocols have become mainstream, with Raft being one of them. Each node in a Raft cluster can be in one of three states: Leader, Follower, or Candidate. When a node starts, it enters the Follower state and elects a new Leader through Leader Election. Raft uses the concept of terms to avoid split-brain scenarios, ensuring that there is only one Leader in any term. After a Leader is elected, all read and write requests to the Raft cluster go through the Leader. The process of handling write requests is called Log Replication. The Leader persists the client’s write request to its log and synchronizes the log to other replicas (Followers). The Leader confirms the write completion to the client only after receiving acknowledgments from the majority of replicas. The log modifications are then applied to the state machine, a step called Apply. This mechanism also applies to modifying the Raft cluster configuration, such as adding or removing nodes. For read requests, the classic Raft implementation handles them the same way as write requests, but optimizations like read index and read lease can be applied.

TiKV uses the Raft protocol to encapsulate a KV engine on top of RocksDB. This engine synchronizes writes to multiple replicas, ensuring data consistency and availability even if some nodes fail.

Storage Engine (LSM Tree & RocksDB)

LSM tree stands for Log-Structured Merge tree, a data structure optimized for fast writes. All writes, including inserts, updates, and deletes, are append-only, sacrificing some read performance. This trade-off is based on the fact that sequential writes on mechanical hard drives are much faster than random writes. On SSDs, the difference between sequential and random writes is less significant, so the advantage is not as pronounced.

An LSM tree typically consists of a write-ahead log (WAL), memtable, and SST files. The WAL ensures that written data is not lost. Write operations first append the data to the WAL. After writing to the WAL, the data is written to the memtable, usually implemented as a skip list for cache optimization and lock-free concurrent modifications. When the memtable reaches a certain size, its contents are flushed to disk as SSTable files. SSTables are compacted in the background to remove obsolete data.

LevelDB is a typical LSM tree implementation open-sourced by Google. RocksDB is a high-performance single-node KV store developed by Facebook based on LevelDB. RocksDB adds advanced features like column families, delete range, multi-threaded compaction, prefix seek, and user-defined properties, significantly improving functionality and performance.

RocksDB (LevelDB) architecture is shown below, where the memtable is converted to an immutable memtable and dumped to disk when full.

RocksDB (LevelDB) Architecture

TiKV uses RocksDB for single-node persistent storage. rust-rocksdb is a Rust wrapper for RocksDB, allowing TiKV to operate RocksDB.

Since RocksDB is an optimized version of LevelDB, most LevelDB features are retained. Reading LevelDB-related materials is helpful for understanding RocksDB.

Reference:

LevelDB is a KV storage engine open-sourced by Google’s legendary engineers Jeff Dean and Sanjay Ghemawat. Its design and code are exquisite and elegant, worth savoring. The following blogs will introduce LevelDB’s design and code details from various aspects, including design ideas, overall structure, read/write processes, and compaction processes, providing a comprehensive understanding of LevelDB.

https://draveness.me/bigtable-leveldb

LSM Tree has write amplification and read amplification issues. On SSDs, reducing write amplification can improve write efficiency. Several papers optimize this:

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.