Meaning of Exec_retry_count and Txn_retry Fields

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Exec_retry_count 和 Txn_retry 字段的含义

| username: TiDBer_KkruFifg

[TiDB Usage Environment] Production Environment / Test / PoC
Production Environment

[TiDB Version]
v5.4.1

[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
Why are the values of Exec_retry_count and Txn_retry fields in the slow log inconsistent, as shown below?

  1. What do these two fields respectively represent?
  2. What is the minimum value of max-retry-count? The official documentation does not specify.

Exec_retry_time: 48.638258389 Exec_retry_count: 121

Query_time: 48.652885978

Parse_time: 0

Compile_time: 0.002173023

Rewrite_time: 0.001878843

Optimize_time: 0

Wait_TS: 0.000009758

Cop_time: 0.001384656 Process_time: 0.002 LockKeys_time: 36.472153953 Request_count: 11 Prewrite_time: 12.163131895 Wait_prewrite_binlog_time: 0.000000364 Commit_backoff_time: 12.151 Backoff_types: [txnLock txnLock txnLock txnLock txnLock txnLock txnLock txnLock txnLock txnLock] Resolve_lock_time: 0.18589224 Write_keys: 60 Write_size: 5640 Prewrite_region: 30 Txn_retry: 1 Process_keys: 20 Total_keys: 2085 Rocksdb_delete_skipped_count: 1753 Rocksdb_key_skipped_count: 2287 Rocksdb_block_cache_hit_count: 281

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Log/Monitoring]

| username: hey-hoho | Original post link

max-retry-count

  • The maximum number of retries for a single statement in a pessimistic transaction. If the number of retries exceeds this limit, the statement execution will report an error.
  • Default value: 256

Source: TiDB 配置文件描述 | PingCAP 文档中心
Various retries for lock conflicts reference: TiDB 锁冲突问题处理 | PingCAP 文档中心

  • Exec_retry_count: Indicates the number of retries for executing this statement. It generally appears in pessimistic transactions, retrying the execution of the statement when locking fails.
  • Txn_retry: There is no official explanation found, but from the literal meaning, it can be guessed as the number of retries for the entire transaction.
| username: 昵称想不起来了 | Original post link

If you can’t find it in the official documentation, you can try searching for keywords on GitHub. Variable definitions generally have comments that briefly explain their meanings.

	// TxnRetryStr means the count of transaction retry.
	TxnRetryStr = "Txn_retry"
| username: 昵称想不起来了 | Original post link

It seems that max-retry-count does not have any maximum or minimum value requirements.

| username: TiDBer_KkruFifg | Original post link

May I ask why TiDB’s pessimistic lock also has the concept of retry?

| username: TiDBer_KkruFifg | Original post link

The TiDB cluster uses pessimistic transactions, i.e., SET GLOBAL tidb_txn_mode = 'pessimistic'; However, the business DML is autocommit and does not start explicit transactions. Some key parameter settings are as follows:
tidb_disable_txn_auto_retry = on # This parameter controls whether to automatically retry

So, what is the difference between setting tidb_retry_limit to 0 and 10? Below is my personal understanding, is it correct?

Because tidb_retry_limit = 0 is set to 0, transactions will not be retried and will immediately return a failure. Below is the log, you can see the keyword IsPessimistic=false can not retry txn

[2023/10/09 23:18:01.126 +08:00] [WARN] [session.go:843] ["can not retry txn"] [conn=8786589] [label=general] [error="[kv:9007]Write conflict, txnStartTS=444822894948384807, conflictStartTS=444822894948384798, conflictCommitTS=444822894948384829, key={tableID=73, indexID=3, indexValues={8831587085283933373, 312893834, user, aa, 2, 1692028799, 1687751522, 1696344993, 312893834035306, }} primary={tableID=73, indexID=3, indexValues={8831587085283933373, 312893834, user, aa, 2, 1692028799, 1687751522, 1696344993, 312893834035306, }} [try again later]"] [IsBatchInsert=false] [IsPessimistic=false] [InRestrictedSQL=false] [tidb_retry_limit=0] [tidb_disable_txn_auto_retry=true]

Below is the log when tidb_retry_limit = 10 is set to 10, the keyword is try again later, you can see the retry

[2023/10/09 16:49:30.805 +08:00] [WARN] [session.go:833] [sql] [conn=8764471] [label=general] [error="[kv:9007]Write conflict, txnStartTS=444816784279995595, conflictStartTS=444816784279995574, conflictCommitTS=444816784279995683, key={tableID=73, indexID=3, indexValues={8831587085283933373, 36329996, user, aa, 2, 1673366399, 1671216578, 1696737088, 36329996035306, }} primary={tableID=73, indexID=3, indexValues={8831587085283933373, 36329996, user, aa, 2, 1673366399, 1671216578, 1696737088, 36329996035306, }} [try again later]"] [txn="Txn{state=invalid}"]

Why does TiDB retry in pessimistic transactions?
Because the value of tidb_retry_limit is not 0, TiDB first adopts an optimistic lock, and in case of conflict, it retries using a pessimistic transaction. This is how TiDB is implemented.

So, if tidb_retry_limit=0 is set to 0, can it achieve the effect of reporting a conflict immediately without retrying?
That is, optimistic transactions will not retry, nor will they enter the scenario of retrying using pessimistic transactions, right? The link is as follows:

| username: TiDBer_KkruFifg | Original post link

Txn_retry: A field in the slow log indicating the number of optimistic transaction retries.
Exec_retry_count: A field in the slow log indicating the number of pessimistic transaction retries.
max-retry-count: A configuration parameter that sets the maximum number of retries for pessimistic transactions.
tidb_retry_limit: A configuration parameter that sets the number of retries for optimistic transactions.

Summary:
Even if the TiDB cluster uses the pessimistic transaction model, if the transaction is auto-committed (i.e., autocommit):

  • tidb_retry_limit = n
    The first attempt uses an optimistic transaction, and conflict retries use a pessimistic transaction. The maximum number of retries for pessimistic transactions is determined by max-retry-count.

  • tidb_retry_limit = 0
    Automatic retries are disabled. In case of a conflict, there are neither optimistic nor pessimistic retries.

Why does TiDB have the concept of pessimistic retries while MySQL does not? Here is the understanding:

  • To maintain behavior consistent with MySQL, configure explicit transactions.
  • Otherwise, TiDB will execute its own implementation of pessimistic transactions, which includes the concept of retries. It is still unclear why TiDB implements it this way and why there is a concept of pessimistic retries.

Below are TiDB logs under different configurations:

  • tidb_retry_limit = 3 and autocommit and tidb_disable_txn_auto_retry = on
    [2023/09/09 16:49:30.805 +08:00] [WARN] [session.go:833] [sql] [conn=8764471] [label=general] [error=“[kv:9007]Write conflict, txnStartTS=444816784279995595, conflictStartTS=444816784279995574, conflictCommitTS=444816784279995683, key={tableID=73, indexID=3, indexValues={8831587085283933373, 36329996, user, aa, 2, 1673366399, 1671216578, 1696737088, 36329996035306, }} primary={tableID=73, indexID=3, indexValues={8831587085283933373, 36329996, user, aa, 2, 1673366399, 1671216578, 1696737088, 36329996035306, }} [try again later]”] [txn=“Txn{state=invalid}”]

  • tidb_retry_limit = 0 and autocommit and tidb_disable_txn_auto_retry = on
    [2023/09/09 23:18:01.126 +08:00] [WARN] [session.go:843] [“can not retry txn”] [conn=8786589] [label=general] [error=“[kv:9007]Write conflict, txnStartTS=444822894948384807, conflictStartTS=444822894948384798, conflictCommitTS=444822894948384829, key={tableID=73, indexID=3, indexValues={8831587085283933373, 312893834, user, aa, 2, 1692028799, 1687751522, 1696344993, 312893834035306, }} primary={tableID=73, indexID=3, indexValues={8831587085283933373, 312893834, user, aa, 2, 1692028799, 1687751522, 1696344993, 312893834035306, }} [try again later]”] [IsBatchInsert=false] [IsPessimistic=false] [InRestrictedSQL=false] [tidb_retry_limit=0] [tidb_disable_txn_auto_retry=true]

| username: hey-hoho | Original post link

In summary, locking in a distributed architecture is a high-cost operation, and retries are needed to ensure the efficiency of successful commits.