TiKV Write Blockage Causes Query Timeout

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv写入阻塞导致查询超时

| username: TiDBer_C0UeEuOq

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.7
[Reproduction Path] Occasional, cannot be reproduced in a fixed way
[Encountered Problem: Problem Phenomenon and Impact]
The business feedback indicates that query timeouts (configured for 5s) occur daily, and the timing is not very fixed.
The abnormal statement is fixed, querying the table by primary key, and each time only one row of data is returned.

SELECT * FROM `table1` WHERE `uid` = ********* LIMIT 1

Combining the feedback timeout period, the corresponding log WARN can be found on a fixed TiKV each time, and the logs on other TiKVs have been checked, and no errors have occurred.

[2024/01/07 16:39:44.045 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.822788231s, send: 0.000000817s, callback: 0s thread: sync-writer"] [takes=1822]

The cluster’s write load is not very high (peak as shown in the figure is about 700k), the only hotspot region appears on the primary key of another table (NONCLUSTERED table, not the same table as table1).
The primary key writes are sequential, so they cannot be scattered and will be written to a fixed TiKV.
Here, the leader of this hotspot region was checked to be on the abnormal TiKV, which may be related.

[Resource Configuration]

[Attachments: Screenshots/Logs/Monitoring]
TiKV logs during the abnormal period

[2024/01/07 16:39:40.862 +08:00] [INFO] [apply.rs:1770] ["exec ConfChange"] [epoch="conf_ver: 24666 version: 12599"] [type=RemoveNode] [peer_id=171471415] [region_id=480019]
[2024/01/07 16:39:40.862 +08:00] [INFO] [apply.rs:1878] ["remove peer successfully"] [region="id: 480019 start_key: 7480000000000004FF175F72E000000309FFB32AD70000000000FA end_key: 7480000000000004FF175F72E8000002FCFFA3F83C0000000000FA region_epoch { conf_ver: 24666 version: 12599 } peers { id: 171471152 store_id: 137 } peers { id: 171471173 store_id: 136 } peers { id: 171471362 store_id: 627568 role: Learner } peers { id: 171471415 store_id: 1 }"] [peer="id: 171471362 store_id: 627568 role: Learner"] [peer_id=171471415] [region_id=480019]
[2024/01/07 16:39:40.862 +08:00] [INFO] [raft.rs:2646] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {171471152, 171471173, 171471415} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }"] [raft_id=171471415] [region_id=480019]
[2024/01/07 16:39:40.899 +08:00] [INFO] [peer.rs:1172] ["deleting applied snap file"] [snap_file=480019_2231_45044232] [peer_id=171471415] [region_id=480019]
[2024/01/07 16:39:40.899 +08:00] [INFO] [snap.rs:674] ["set_snapshot_meta total cf files count: 4"]
[2024/01/07 16:39:42.220 +08:00] [INFO] [kv.rs:1117] ["call CheckLeader failed"] [address=ipv4:*.*.*.*:50794] [err=Grpc(RemoteStopped)]
[2024/01/07 16:39:44.045 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.822788231s, send: 0.000000817s, callback: 0s thread: sync-writer"] [takes=1822]
[2024/01/07 16:39:44.045 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.822514695s, send: 0.000019696s, callback: 0s thread: sync-writer"] [takes=1822]
[2024/01/07 16:39:44.045 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.8169129910000001s, send: 0.000025976s, callback: 0s thread: sync-writer"] [takes=1816]
[2024/01/07 16:39:44.045 +08:00] [WARN] [store.rs:999] ["[store 1] handle 1 pending peers include 1 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1822]
[2024/01/07 16:39:44.045 +08:00] [WARN] [store.rs:999] ["[store 1] handle 1 pending peers include 1 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1822]
[2024/01/07 16:39:44.045 +08:00] [WARN] [store.rs:999] ["[store 1] handle 1 pending peers include 1 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1816]
[2024/01/07 16:39:44.572 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.195803191s, send: 0.000029982s, callback: 0s thread: sync-writer"] [takes=2195]
[2024/01/07 16:39:44.572 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.194970723s, send: 0.000026482s, callback: 0s thread: sync-writer"] [takes=2194]
[2024/01/07 16:39:44.572 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.195950804s, send: 0.000026311s, callback: 0s thread: sync-writer"] [takes=2195]
[2024/01/07 16:39:44.572 +08:00] [WARN] [store.rs:999] ["[store 1] handle 110 pending peers include 110 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2197]
[2024/01/07 16:39:44.572 +08:00] [WARN] [store.rs:999] ["[store 1] handle 119 pending peers include 119 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2197]
[2024/01/07 16:39:44.572 +08:00] [WARN] [store.rs:999] ["[store 1] handle 108 pending peers include 108 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2197]
[2024/01/07 16:39:45.356 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.307429111s, send: 0.000030356s, callback: 0s thread: sync-writer"] [takes=1307]
[2024/01/07 16:39:45.356 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.307435468s, send: 0.000028111s, callback: 0s thread: sync-writer"] [takes=1307]
[2024/01/07 16:39:45.356 +08:00] [WARN] [store.rs:999] ["[store 1] handle 175 pending peers include 152 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1311]
[2024/01/07 16:39:45.356 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 1.3087153790000001s, send: 0.000053301s, callback: 0s thread: sync-writer"] [takes=1308]
[2024/01/07 16:39:45.356 +08:00] [WARN] [store.rs:999] ["[store 1] handle 176 pending peers include 155 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1311]
[2024/01/07 16:39:45.356 +08:00] [WARN] [store.rs:999] ["[store 1] handle 155 pending peers include 155 ready, 0 entries, 0 messages and 0 snapshots"] [takes=1311]
[2024/01/07 16:39:47.387 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.812094433s, send: 0.000039904s, callback: 0s thread: sync-writer"] [takes=2812]
[2024/01/07 16:39:47.387 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.810821532s, send: 0.000075818s, callback: 0s thread: sync-writer"] [takes=2810]
[2024/01/07 16:39:47.387 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.812195442s, send: 0.000088624s, callback: 0s thread: sync-writer"] [takes=2812]
[2024/01/07 16:39:47.387 +08:00] [WARN] [store.rs:999] ["[store 1] handle 184 pending peers include 173 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2814]
[2024/01/07 16:39:47.387 +08:00] [WARN] [store.rs:999] ["[store 1] handle 191 pending peers include 191 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2815]
[2024/01/07 16:39:47.387 +08:00] [WARN] [store.rs:999] ["[store 1] handle 188 pending peers include 188 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2815]
[2024/01/07 16:39:48.038 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.679254154s, send: 0.000042336s, callback: 0s thread: sync-writer"] [takes=2679]
[2024/01/07 16:39:48.038 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.676675738s, send: 0.000043835s, callback: 0s thread: sync-writer"] [takes=2676]
[2024/01/07 16:39:48.038 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.678514685s, send: 0.000043762s, callback: 0s thread: sync-writer"] [takes=2678]
[2024/01/07 16:39:48.038 +08:00] [WARN] [store.rs:999] ["[store 1] handle 118 pending peers include 111 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2681]
[2024/01/07 16:39:48.038 +08:00] [WARN] [store.rs:999] ["[store 1] handle 203 pending peers include 200 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2681]
[2024/01/07 16:39:48.038 +08:00] [WARN] [store.rs:999] ["[store 1] handle 214 pending peers include 214 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2681]
[2024/01/07 16:39:50.065 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.67392705s, send: 0.000094879s, callback: 0s thread: sync-writer"] [takes=2673]
[2024/01/07 16:39:50.065 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.672336576s, send: 0.000149155s, callback: 0s thread: sync-writer"] [takes=2672]
[2024/01/07 16:39:50.065 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.671493778s, send: 0.000174822s, callback: 0s thread: sync-writer"] [takes=2671]
[2024/01/07 16:39:50.065 +08:00] [WARN] [store.rs:999] ["[store 1] handle 219 pending peers include 219 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2678]
[2024/01/07 16:39:50.065 +08:00] [WARN] [store.rs:999] ["[store 1] handle 228 pending peers include 228 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2678]
[2024/01/07 16:39:50.065 +08:00] [WARN] [store.rs:999] ["[store 1] handle 224 pending peers include 223 ready, 0 entries, 0 messages and 0 snapshots"] [takes=2678]
[2024/01/07 16:39:50.105 +08:00] [INFO] [kv.rs:1117] ["call CheckLeader failed"] [address=ipv4:*.*.*.*:60840] [err=Grpc(RemoteStopped)]
[2024/01/07 16:39:50.625 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.58315059s, send: 0.000085109s, callback: 0s thread: sync-writer"] [takes=2583]
[2024/01/07 16:39:50.625 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.581864045s, send: 0.000096699s, callback: 0s thread: sync-writer"] [takes=2581]
[2024/01/07 16:39:50.625 +08:00] [WARN] [write.rs:634] ["[store 1] async write too slow, write_kv: 0s, write_raft: 2.581480991s, send: 0.000088852s, callback: 0s thread: sync-w
| username: 路在何chu | Original post link

Is the hardware of the TiKV on store1 not good enough?

| username: TiDBer_C0UeEuOq | Original post link

store1 has the exact same configuration as the other TiKV nodes. We even replaced the node, but after the replacement, the errors just moved to another TiKV node. :joy:

| username: 小龙虾爱大龙虾 | Original post link

Business timeout, SQL timeout, you can check the slow query analysis.
The slow write on store 1 might indeed be an issue. You can check TiKV => Storage and TiKV => Raft IO, and then look at the disk monitoring situation.
The bright line on the heatmap is not a hotspot; it’s only 565 kb/min, how can that be considered a hotspot?

| username: dba远航 | Original post link

write_raft: 1.822788231s The problem is here.

| username: TiDBer_C0UeEuOq | Original post link

The disk IO is not very high. The heatmap I captured is the only bright line :joy:
It indeed looks like an issue with raft propose. The time consumption and the log errors match perfectly.

| username: 路在何chu | Original post link

It is estimated that a certain slow SQL caused it.

| username: tug_twf | Original post link

In fact, from the machine monitoring perspective, the only bottleneck is write_raft, but it almost always appears on a specific node. After replacing this node, the issue shifted to another node. It is suspected that frequent updates to a certain region are causing this. From the analysis of hotspots and machines, nothing is very obvious. Analyzing from the slow query angle, no suspicious slow SQL has been found either.

| username: WalterWj | Original post link

Post the thread-cpu monitoring and grpc monitoring panel in tikv-detail.

| username: terry0219 | Original post link

Tables with primary keys can also be scattered. Use ALTER TABLE t MODIFY COLUMN id BIGINT AUTO_RANDOM(5); to switch from AUTO_INCREMENT to AUTO_RANDOM.

| username: TiDBer_C0UeEuOq | Original post link

The abnormal node is tikv02

You can refer to the following tikv02 latency monitoring during the abnormal period