TiDB SQL Write Process

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB SQL 写流程

| username: Billmay表妹

By Jinlong Liu
ps: The content is slightly outdated, refer to the general process and corrections.

Overview

  1. The client initiates a transaction begin to TiDB.

  2. TiDB requests TSO from PD as the start_ts of the current transaction.

This part of the process can refer to the read process.

  1. TiDB analyzes the request for MySQL protocol parsing and conversion;

  2. The text is parsed into structured data, i.e., an Abstract Syntax Tree (AST). Once the AST is obtained, various validations (preprocessing, legality validation, permission checks), transformations, and optimizations can be performed. These two steps are the parse and compile processes. Asynchronously, it also interacts with PD to get the start_ts information;

  3. TiDB retrieves information_schema from the cache (cached at startup), if not, it retrieves information_schema from TiKV.

  4. TiDB gets the metadata of the table being operated on by the current user from information_schema.

  5. After preparing the execution plan and obtaining TSO, TiDB organizes it into TiKV’s KeyRange based on the keyrange with the table’s metadata.

  6. TiDB retrieves the regions information for each KeyRange from the Region Cache or PD.

  7. TiDB groups the KeyRange based on the regions information.

[TIKV]

  1. gRPC request (server.grpc-concurrency).

  2. TiDB concurrently distributes select requests to all TiKV corresponding to the regions.

  3. TiKV filters the data based on TiDB’s request and returns it to TiDB (readpool.storage readpool.coprocessor).

[TiDB]

  1. TiDB organizes the data after receiving all results.

  2. The client executes the following requests to TiDB:

Read operation, read specific data corresponding to version start_ts from TiKV.

Write operation, write into memory (before transaction commit, buffer all update/delete operations on the client side).

  1. The client initiates a commit transaction request to TiDB.

  2. TiDB starts two-phase commit.

  3. TiDB groups the data to be written according to the region.

  4. TiDB starts the prewrite operation: concurrently executes prewrite requests to all regions involved in the changes. If any prewrite fails, the handling method is determined based on the error type:

KeyIsLock: Attempt to Resolve Lock, if successful, retry the prewrite of the current region [step 7]. Otherwise, re-acquire TSO as start_ts to start 2PC commit (step 5).

WriteConflict: Another transaction is writing the current key, re-acquire TSO as start_ts to start 2PC commit (step 5).

Other errors, return failure to the client.

  1. TiDB requests TSO from PD as the commit_ts of the current transaction.

  2. TiDB starts commit: TiDB initiates commit to the region where the primary is located. If commit primary fails, execute rollback keys first, then determine whether to retry based on the error:

LockNotExist: Re-acquire TSO as start_ts to start 2PC commit (step 5).

Other errors, return failure to the client.

  1. TiDB asynchronously and concurrently initiates commit to the remaining regions to TiKV.

  2. TiDB returns transaction commit success information to the client.

1. TiDB Layer

1. Protocol Layer Entry

After establishing a connection with the client, TiDB has a Goroutine listening on the port, waiting for packets from the client and processing them. Using clientConn.Run() in a loop, it continuously reads network packets. Then it calls the dispatch() method to process the received requests and calls the corresponding handler function based on the Command type. The most commonly used Command is COM_QUERY, for most SQL statements, as long as they are not using the Prepared method, they are COM_QUERY. For Command Query, the main content sent from the client is SQL text, and the handler function is handleQuery().

By default, the connection limit is 1000, controlled by the token-limit parameter. If this parameter is set too low, creating new connections will take longer. Normally, “Get Token Duration” should be less than 2us.

Note: In OLTP scenarios, it is recommended that the number of connections should be less than 500, and the even distribution of connections among different TiDB servers should be closely monitored.

Note: In OLTP scenarios, it is recommended that 99.9% of query latency should be less than 500ms. Heap Memory, when local latch is enabled, the memory usage of each TiDB server should be less than 3GB, if disabled, it should be less than 1GB.

2. Parser Module (Lexer & Yacc)

TiDB uses the lexical analyzer lex for abstraction + syntax analyzer yacc to generate a tree structure. These two components together form the Parser module. By calling the Parser, the text can be parsed into structured data, i.e., an Abstract Syntax Tree (AST).

Monitoring Explanation:

Grafana Metrics provides the time consumed for SQL parsing into AST, which should normally be less than 10ms.

The average parsing time is: tidb_session_parse_duration_seconds_sum{sql_type=“general”}

/tidb_session_parse_duration_seconds_count{sql_type=“general”}

3. Compile Module

After obtaining the AST, TiDB processes it through the buildSelect() function, and then the AST becomes a tree structure of a Plan.

Once the AST is obtained, various validations (preprocessing, legality validation, permission checks), transformations, and optimizations can be performed. The entry for this series of actions is here Compile function, with the three most important steps:

  • plan.Preprocess: Perform some legality checks and name binding;
  • plan.Optimize: Formulate the query plan and optimize it, which is one of the core steps. There is a special function TryFastPlan, if the judgment rules here can meet PointGet, it will skip the subsequent optimization and directly go to point query.
  • Construct the executor.ExecStmt structure: This ExecStmt structure holds the query plan, which is the basis for subsequent execution, especially the Exec method (to be introduced later).

Logical optimization mainly involves logical transformations based on the equivalence exchange rules of relational algebra. It consists of a series of optimization rules, which are applied sequentially to the incoming LogicalPlan Tree. For details, see the logicalOptimize() function. TiDB’s logical optimization rules include column pruning, max-min elimination, projection elimination, predicate pushdown, subquery de-correlation, outer join elimination, aggregation elimination, etc.

Physical optimization (dagPhysicalOptimize function), based on the results of logical optimization, considers the distribution of data and decides how to choose physical operators. The optimal execution path is selected, and this selection process depends on statistical information. TiDB’s physical optimization rules mainly optimize data reading, table join methods, table join order, sorting, etc.

Note: For insert statements, you can refer to planBuilder.buildInsert() (complete object information, process data in lists) insert introduction logic TiDB 源码阅读系列文章(四)Insert 语句概览 | PingCAP

Monitoring Explanation:

Grafana Metrics provides the time consumed from AST parsing to query plan generation, which should normally be less than 30ms. The average execution plan generation time is: tidb_session_compile_duration_seconds_sum/tidb_session_compile_duration_seconds_count

4. Generate Executor

In this process, TiDB converts the plan into an executor so that the execution engine can execute the query plan set by the optimizer through the executor. For specific code, see ExecStmt.buildExecutor().

5. Run Executor

TiDB constructs all physical Executors into a tree structure, and each layer obtains results by calling the Next/NextChunk() method of the next layer.

For example: SELECT c1 FROM t WHERE c2 > 1;

Monitoring Explanation:

The size of the chunk can be controlled by the parameter tidb_max_chunk_size (session/global) to control how many rows to fetch at once. The Executor requests execution for one-time memory allocation based on the specified number of rows. Recommended values:

OLTP APP = 32

OLAP APP = 1024+

Version 3.0 made some adjustments, with two parameters, init_chunk_size and max_chunk_size, meaning that in version 3.0, no adjustment is needed.

6. Get Key’s Region and TSO

TiDB locates the region and obtains TSO by sending requests to PD. The PD client implements the GetRegion interface. By calling this interface, we can locate the Region where the key is located.

To prevent multiple Regions within a range from being repeatedly called GetRegion in a loop, each time data is read, it needs to access PD first. This would put a huge load on PD and affect the performance of the request. The Tikv-client implements a RegionCache component to cache Region information. When locating the Region where the key is located, if the RegionCache hits, there is no need to access PD.

Inside the RegionCache, there are two data structures to store Region information, one is map, and the other is b-tree. The map can quickly find the Region based on the region ID, and the b-tree can find the Region containing the key based on a key. Through the LocateKey method provided by the region cache, we can find which regions contain the data within a key range. If the Region information changes due to Region splitting or Region migration, the requested Region information will expire. At this time, the tikv-server will return a Region error. When encountering a Region error, clean the RegionCache (only clean the changed information), get the latest Region information again, and resend the request.

Region Cache Update Mechanism:

TiDB first accesses PD to get TSO, then accesses the local region cache of the tidb-server, and then sends the request to Tikv according to the obtained routing information. If Tikv returns an error indicating that the routing information is outdated, the tidb-server will re-fetch the latest region routing information from PD and update the region cache. If the request is sent to a follower, Tikv will return a not leader error and return the leader’s information to the tidb-server, and then the tidb-server updates the region cache.

TSO Acquisition Mechanism:

Point Get also fetches TSO by sending a TSO request to PD during the PrepareTxnCtx phase (planned for optimization). The PrepareTxnCtx execution phase is very early, executed before parsing the SQL. However, although Point Get requests TSO, it does not call txnFuture.wait to read it. In some cases, a Txn object is constructed directly using maxTs, meaning that Point Get does not wait for PD’s return but still puts some pressure on PD (although it will batch). In other cases, at least one TSO (start_ts) will be fetched, and it should be noted that start_ts being 0 does not mean that no TSO request was sent to PD.

  1. Single statement autoCommit
  2. Read statement
  3. Queries involving only one key, i.e., point queries.

All TiDB interactions with PD are done through a PD Client object, which is created when the server starts and creates the Store. After creation, a new thread is started to batch fetch TSO from PD. The workflow of this thread is roughly as follows:

  1. The TSO thread listens to a channel. If there is a TSO request in this channel, it starts requesting TSO from PD (if there are multiple requests in this channel, this request will be batched).
  2. The batch-requested TSO is allocated to these requests.

These TSO requests are divided into three stages:

  1. Put a TSO request into this channel, corresponding to the GetTSAsync function. Calling this function will get a tsFuture object.
  2. The TSO thread takes the request from the channel and sends an RPC request to PD to get TSO. After getting TSO, it allocates it to the corresponding request.
  3. After TSO is allocated to the corresponding request, TSO can be obtained through the tsFuture held (calling tsFuture.Wait()).

Monitoring Explanation:

Currently, the system does not set corresponding monitoring information for the first stage. This process is usually very fast unless the channel is full. Otherwise, the GetTSAsync function returns quickly, and a full channel indicates that the RPC request delay may be too high. Therefore, the RPC request duration can be further analyzed.

  • PD TSO RPC Duration: Reflects the time taken to send an RPC request to PD. If this process is slow, there are two possibilities:
    • High network latency between TiDB and PD
    • PD load is too high to handle TSO RPC requests in a timely manner
  • TSO Async Duration: The time consumed from obtaining a tsFuture to calling tsFuture.Wait(). After obtaining tsFuture, SQL parsing and compiling into an execution plan are still needed. The actual execution will call tsFuture.Wait(). Therefore, if this part of the delay is too long, it may be because:
    • This SQL is very complex, and parsing takes a long time
    • Compiling takes a long time
  • PD TSO Wait Duration: After obtaining a tsFuture, if the parsing and compiling process is completed quickly, calling tsFuture.Wait() may still need to wait if the PD’s TSO RPC has not returned. This time reflects the waiting time for this segment.
  • PD TSO Duration: The time for the entire process above

Therefore, the correct way to analyze this part is to first check whether the overall TSO delay is too large, and then check which specific stage has excessive delay.

Note:

Reading only requires fetching TSO from PD once. For writing, two TSO fetches are needed because the write process involves two-phase commit, requiring two TSO fetches, including start_ts and commit_ts. Temporarily, it will not support calculating commit_ts from start_ts to reduce the number of fetches. Therefore, in stress testing scenarios, the PD TSO Wait Duration metric for write scenarios is generally much higher than that for read scenarios.

tikv-client configuration parameters:

max-batch-size = 15 The maximum number of RPC packets sent in a batch. If not 0, the BatchCommands API will be used to send requests to TiKV, which can reduce RPC latency in high concurrency situations. It is recommended not to modify this value.

grpc-connection-count = 16 The maximum number of connections established with each TiKV.

7. Construct twoPhaseCommitter

When a transaction is ready to commit, a twoPhaseCommiter is created to execute the distributed transaction. During construction, the following tasks need to be done:

  • [Collect all keys
| username: CuteRay | Original post link

Good article, bookmarked (whispering: the directory structure is incorrect)

| username: Billmay表妹 | Original post link

I am still adjusting.

| username: Meditator | Original post link

Decisively bookmarked again~!

| username: cs58_dba | Original post link

Learn and gradually absorb through practice.

| username: lh412 | Original post link

Awesome, thumbs up, it solved a lot of my questions.