Issue of Cluster Response Time Surge After TiKV Node Restart

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 节点重启后,集群响应时间飙升问题

| username: 小老板努力变强

【TiDB Usage Environment】Production environment
【TiDB Version】v6.1.4
【Reproduction Path】After one TiKV node restarts, another node starts outputting prepare_write_err exceptions.
【Encountered Problem: Symptoms and Impact】Overall cluster latency increases
【Resource Configuration】8 TiKV machines, 24 instances
【Attachments: Screenshots/Logs/Monitoring】




TiDB node reports an error:
[2023/11/09 11:15:08.205 +08:00] [WARN] [session.go:1966] [“run statement failed”] [schemaVersion=233964] [error=“previous statement: update mysql.table_cache_meta set lock_type = ‘READ’, lease = 445513650692423680 where tid = 203252: [kv:9007]Write conflict, txnStartTS=445513651491963030, conflictStartTS=445513651491963037, conflictCommitTS=0, key={tableID=57, handle=203252} primary=byte(nil) [try again later]”] [session=“{\n "currDBName": "",\n "id": 0,\n "status": 2,\n "strictMode": true,\n "user": null\n}”]
[2023/11/09 11:15:08.205 +08:00] [WARN] [cache.go:205] [“lock cached table for read”] [error=“previous statement: update mysql.table_cache_meta set lock_type = ‘READ’, lease = 445513650692423680 where tid = 203252: [kv:9007]Write conflict, txnStartTS=445513651491963030, conflictStartTS=445513651491963037, conflictCommitTS=0, key={tableID=57, handle=203252} primary=byte(nil) [try again later]”]

| username: DBRE | Original post link

Please provide the logs for the restarted TiKV node. If there are issues with the TiKV node, remove it first to prevent further loss.

| username: h5n1 | Original post link

Have any small tables been set as cache tables?

| username: 小老板努力变强 | Original post link

Indeed, after moving the small table out, it recovered.

| username: Fly-bird | Original post link

Is TiKV performing data synchronization?

| username: Billmay表妹 | Original post link

Collect the logs and let the R&D team take a look.

| username: tiancaiamao | Original post link

It is likely that the transaction failed abnormally due to the restart. Subsequently, a large number of transaction-related locks remained in the data. The process of clearing locks after the restart caused an increase in transaction latency.

| username: h5n1 | Original post link

It’s unreasonable to say that restarting TiKV causes transaction failures, and it got better after he disabled the table cache.

| username: tiancaiamao | Original post link

Then we need to check if the source of the conflicts is all on the cached table.

| username: h5n1 | Original post link

The TiDB logs he provided show a conflict with this metadata table.