Unable to delete specific table row

Hi,
we’re running a small production database and we stumbled upon a row in one of our tables that just cannot be deleted.

> delete from records where id = 3991874453;

ERROR 1105 (HY000): unexpected resolve err:
commit_ts_expired:<start_ts:455172192182209523 attempted_commit_ts:455172192182210339 
key:"t\200\000\000\000\000\000\000r_i\200\000\000\000\000\000\000\003\000\003\200\000\000\000\355\357+\225" min_commit_ts:455172192785137665 >,
lock: key: 7480000000000000725f698000000000000003000380000000edef2b95,
primary: 7480000000000000725f7280000000edef2b20,
txnStartTS: 455172192182209523,
lockForUpdateTS:455172192182209837,
minCommitTs:455172192785137665,
ttl: 3857,
type: Put,
UseAsyncCommit

According to docs:

This error usually occurs when the version of TiDB does not match with that of TiKV. To avoid version mismatch, upgrade all components when you upgrade the version.

…but our TiDB and TiKV version are in sync:

$ tikv-server -V
TiKV 
Release Version:   8.5.1
Edition:           Community
Git Commit Hash:   25897b73e1f0e93f616c764dbe65ed7db02edd07
Git Commit Branch: HEAD
UTC Build Time:    2025-01-14 05:02:29
Rust Version:      rustc 1.77.0-nightly (89e2160c4 2023-12-27)
Enable Features:   memory-engine pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine trace-async-tasks openssl-vendored
Profile:           dist_release
$ tidb-server -V
Release Version: v8.5.1
Edition: Community
Git Commit Hash: fea86c8e35ad4a86a5e1160701f99493c2ee547c
Git Branch: HEAD
UTC Build Time: 2025-01-16 07:38:34
GoVersion: go1.23.4
Race Enabled: false
Check Table Before Drop: false
Store: unistore

Information I gathered so far:

  • no persistent locks seem to be visible in information_schema.data_lock_waits
  • UPDATE statement can be performed on this row without issues
  • TiKV compaction (default, lock, write) executed successfully but had no effect on the issue
  • there is no noticeable delay between executing the DELETE statement and receiving the error - the response seems immediate

Application environment:

production

TiDB version:

v8.5.1

Reproduction method:

unknown, the table has 35125 row and this is the only one that seems to cause issues

Resource allocation:

> select * from CLUSTER_INFO;
+-------+-------------------+-------------------+---------+------------------------------------------+---------------------+---------------------+-----------+
| TYPE  | INSTANCE          | STATUS_ADDRESS    | VERSION | GIT_HASH                                 | START_TIME          | UPTIME              | SERVER_ID |
+-------+-------------------+-------------------+---------+------------------------------------------+---------------------+---------------------+-----------+
| tidb  | 10.64.2.115:4000  | 10.64.2.115:10080 | 8.5.1   | fea86c8e35ad4a86a5e1160701f99493c2ee547c | 2025-04-14 12:20:23 | 29h10m21.553669229s |      1661 |
| tidb  | 10.64.3.182:4000  | 10.64.3.182:10080 | 8.5.1   | fea86c8e35ad4a86a5e1160701f99493c2ee547c | 2025-04-14 13:37:35 | 27h53m9.553674789s  |      1313 |
| tidb  | 10.64.1.158:4000  | 10.64.1.158:10080 | 8.5.1   | fea86c8e35ad4a86a5e1160701f99493c2ee547c | 2025-04-14 13:06:08 | 28h24m36.553676857s |      1449 |
| pd    | 10.64.3.182:2379  | 10.64.3.182:2379  | 8.5.1   | 5036cc277fc8c0e7fa1219265ae407de80b950b7 | 2025-04-14 13:37:27 | 27h53m17.553678506s |         0 |
| pd    | 10.64.1.158:2379  | 10.64.1.158:2379  | 8.5.1   | 5036cc277fc8c0e7fa1219265ae407de80b950b7 | 2025-04-14 13:05:51 | 28h24m53.553680362s |         0 |
| pd    | 10.64.2.115:2379  | 10.64.2.115:2379  | 8.5.1   | 5036cc277fc8c0e7fa1219265ae407de80b950b7 | 2025-04-14 12:20:07 | 29h10m37.55368251s  |         0 |
| tikv  | 10.64.1.158:20160 | 10.64.1.158:20180 | 8.5.1   | 25897b73e1f0e93f616c764dbe65ed7db02edd07 | 2025-04-14 13:06:05 | 28h24m39.553684725s |         0 |
| tikv  | 10.64.2.115:20160 | 10.64.2.115:20180 | 8.5.1   | 25897b73e1f0e93f616c764dbe65ed7db02edd07 | 2025-04-14 12:20:20 | 29h10m24.553687298s |         0 |
| tikv  | 10.64.3.182:20160 | 10.64.3.182:20180 | 8.5.1   | 25897b73e1f0e93f616c764dbe65ed7db02edd07 | 2025-04-14 13:37:31 | 27h53m13.553690091s |         0 |
| ticdc | 127.0.0.1:8300    | 127.0.0.1:8300    | 8.5.1   | 0a10d1fae9245c480e31df30f8b13ca4810c3b04 | 2025-04-14 13:36:29 | 27h54m15.553692208s |         0 |
| ticdc | 127.0.0.1:8300    | 127.0.0.1:8300    | 8.5.1   | 0a10d1fae9245c480e31df30f8b13ca4810c3b04 | 2025-04-14 13:37:44 | 27h53m0.553696754s  |         0 |
| ticdc | 127.0.0.1:8300    | 127.0.0.1:8300    | 8.5.1   | 0a10d1fae9245c480e31df30f8b13ca4810c3b04 | 2025-04-14 13:38:41 | 27h52m3.553698772s  |         0 |
| ticdc | 127.0.0.1:8300    | 127.0.0.1:8300    | 8.5.1   | 0a10d1fae9245c480e31df30f8b13ca4810c3b04 | 2025-04-14 13:36:30 | 27h54m14.553701027s |         0 |
| ticdc | 127.0.0.1:8300    | 127.0.0.1:8300    | 8.5.1   | 0a10d1fae9245c480e31df30f8b13ca4810c3b04 | 2025-04-14 13:36:32 | 27h54m12.553702969s |         0 |
| ticdc | 127.0.0.1:8300    | 127.0.0.1:8300    | 8.5.1   | 0a10d1fae9245c480e31df30f8b13ca4810c3b04 | 2025-04-14 13:38:43 | 27h52m1.553705041s  |         0 |
+-------+-------------------+-------------------+---------+------------------------------------------+---------------------+---------------------+-----------+
> select * from TIKV_STORE_STATUS;
+----------+-------------------+-------------+------------------+-------+---------+----------+-----------+--------------+---------------+--------------+-------------+--------------+---------------+--------------------+-------------+---------------------+---------------------+---------------------+
| STORE_ID | ADDRESS           | STORE_STATE | STORE_STATE_NAME | LABEL | VERSION | CAPACITY | AVAILABLE | LEADER_COUNT | LEADER_WEIGHT | LEADER_SCORE | LEADER_SIZE | REGION_COUNT | REGION_WEIGHT | REGION_SCORE       | REGION_SIZE | START_TS            | LAST_HEARTBEAT_TS   | UPTIME              |
+----------+-------------------+-------------+------------------+-------+---------+----------+-----------+--------------+---------------+--------------+-------------+--------------+---------------+--------------------+-------------+---------------------+---------------------+---------------------+
|  5843362 | 10.64.1.158:20160 |           0 | Up               | null  | 8.5.1   | 49.19GiB | 35.27GiB  |          140 |             1 |          140 |       20017 |          420 |             1 | 3055057224.7391706 |       63992 | 2025-04-14 13:06:05 | 2025-04-15 17:31:29 | 28h25m24.399719043s |
|  5839223 | 10.64.2.115:20160 |           0 | Up               | null  | 8.5.1   | 49.19GiB | 34.79GiB  |          137 |             1 |          137 |       21574 |          420 |             1 |  2979299079.548395 |       63992 | 2025-04-14 12:20:20 | 2025-04-15 17:31:31 | 29h11m11.513080127s |
|  5846392 | 10.64.3.182:20160 |           0 | Up               | null  | 8.5.1   | 49.19GiB | 35.55GiB  |          143 |             1 |          143 |       22401 |          420 |             1 | 2941414583.7685237 |       63992 | 2025-04-14 13:37:31 | 2025-04-15 17:31:33 | 27h54m2.197166337s  |
+----------+-------------------+-------------+------------------+-------+---------+----------+-----------+--------------+---------------+--------------+-------------+--------------+---------------+--------------------+-------------+---------------------+---------------------+---------------------+

To analyze the commit timestamp expiration issue in TiDB and propose potential solutions, we need to understand the context of the problem and the relevant features of TiDB that might be involved. Based on the provided knowledge graph and vector search results, here are some insights and potential solutions:

Understanding the Issue

  1. Commit Timestamp Expiration: This issue typically arises when a transaction takes too long to commit, causing its timestamp to become outdated or invalid. This can happen due to long-running transactions or delays in the commit phase.

  2. Transaction Management in TiDB: TiDB supports distributed transactions with both optimistic and pessimistic models. Transactions are managed using statements like BEGIN, COMMIT, and ROLLBACK. The default mode is pessimistic, which can help in scenarios where conflicts are frequent.

  3. Garbage Collection (GC) Timeout: TiDB uses MVCC (Multiple Version Concurrency Control) to manage data versions. Old data is periodically cleaned up by the GC process. If a transaction’s duration exceeds the GC life time, it can lead to errors like GC life time is shorter than transaction duration.

  4. Stale Read and safe-ts in TiKV: Stale reads can occur if transactions are not committed in a timely manner, affecting the resolved-ts (timestamp) calculation. Long-lived transactions can block the advance of resolved-ts.

Potential Solutions

  1. Adjust GC Life Time:

    • For TiDB versions earlier than v5.0, adjust tikv_gc_life_time in the mysql.tidb table.
    • For TiDB v5.0 and later, adjust the system variable tidb_gc_life_time. This can help extend the duration for which old data versions are retained, reducing the likelihood of commit timestamp expiration.
  2. Optimize Transaction Duration:

    • Identify and optimize long-running transactions. Use tools like SHOW PROCESSLIST to monitor active transactions and their durations.
    • Consider breaking down large transactions into smaller ones to reduce the time taken for each commit.
  3. Handle Slow Transaction Commits:

    • Use TiKV logs to identify transactions with long prewrite phases. This can help pinpoint the cause of delays.
    • Ensure that the application logic is optimized to avoid unnecessary delays in transaction processing.
  4. Configure TTL (Time to Live) Jobs:

    • Use TTL jobs to periodically clean up expired data, which can help manage data volume and reduce transaction processing time.
    • Customize the execution period of TTL jobs using the TTL_JOB_INTERVAL attribute.
  5. Monitor and Troubleshoot:

    • Use the CLUSTER_TIDB_TRX table to find active transactions and diagnose issues.
    • Monitor the network and system performance to ensure that there are no bottlenecks affecting transaction processing.

By implementing these solutions, you can mitigate the commit timestamp expiration issue in TiDB and improve the overall performance and reliability of your database operations.