TiKV CPU Anomaly, Raft Election Failure

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv cpu异常,raft选举失败

| username: Inkjade

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.4.0
[Encountered Problem]
High CPU usage, unable to recover automatically, log errors
[peer.rs:553] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1456675] [region_id=1456674]

[Problem Phenomenon and Impact]


[Attachment]

| username: xfworld | Original post link

The logs in the screenshot seem unrelated.

  1. All are lock conflicts.
  2. raftStore is abnormal.

Question 1: Which nodes are abnormal?
Question 2: What operations were performed before these issues occurred?
Question 3: Are the TiDB nodes and PD nodes still functioning normally?
Question 4: Are there any key logs that can be referenced before and after these issues occurred?

| username: Inkjade | Original post link

Question 1: Which nodes are abnormal?
All TiKV nodes.

Question 2: What operations were performed before these issues occurred?
Only special abnormal operations.

Question 3: Are the TiDB nodes and PD nodes still normal?
PD nodes are normal. Since the issue couldn’t be resolved, all TiKV nodes were restarted, and the fault disappeared.

Question 4: Are there any key logs before and after these issues that can be referenced?
Analyzed all TiKV node logs and found the issue occurred on the 19th. The logs are as follows:

[2022/09/19 21:29:00.746 +08:00] [INFO] [apply.rs:1895] [“add learner successfully”] [region=“id: 1386748 start_key: 7480000000000E94FF085F72038ACE4E5DFF380000010150524FFF4649545F48FF414EFF444645455F59FF48FF4B000000000000F9FF038000000001348BFFF700000000000000F8 region_epoch { conf_ver: 17207 version: 3917 } peers { id: 1404274 store_id: 5 } peers { id: 1412642 store_id: 1 } peers { id: 1417136 store_id: 6 }”] [peer=“id: 1418341 store_id: 4 role: Learner”] [peer_id=1404274] [region_id=1386748]
[2022/09/19 21:29:00.748 +08:00] [INFO] [raft.rs:2609] [“switched to configuration”] [config=“Configuration { voters: Configuration { incoming: Configuration { voters: {1417136, 1412642, 1404274} }, outgoing: Configuration { voters: {} } }, learners: {1418341}, learners_next: {}, auto_leave: false }”] [raft_id=1404274] [region_id=1386748]
[2022/09/19 21:29:00.844 +08:00] [INFO] [size.rs:168] [“approximate size over threshold, need to do split check”] [threshold=150994944] [size=155228686] [region_id=1412818]
[2022/09/19 21:29:01.026 +08:00] [INFO] [split_check.rs:289] [“update approximate size and keys with accurate value”] [keys=583983] [size=125929726] [region_id=1412818]
[2022/09/19 21:29:01.101 +08:00] [WARN] [endpoint.rs:606] [error-response] [err=“Key is locked (will clean up) primary_lock: 7480000000000011155F698000000000000001013230323230393132FF323132345F593246FF7A61474A68593273FF7459574E3061585AFF6C4C584E6C636E5AFF70593255744D5441FF352E315F4E574D30FF5A6A6B354F574A68FF4E5441794E445135FF59324A6859546B31FF4E7A6C6C4D574579FF4D44493059324A41FF4D5463794C6A4977FF4C6A45774F433478FF4D446B3D00000000FB lock_version: 436101222375882805 key: 7480000000000011155F698000000000000003038000002F149A574C038000000000209EF2 lock_ttl: 3337 txn_size: 348 lock_type: Del min_commit_ts: 436101222375882806”]
[2022/09/19 21:29:01.435 +08:00] [INFO] [size.rs:168] [“approximate size over threshold, need to do split check”] [threshold=150994944] [size=186687892] [region_id=1375671]
[2022/09/19 21:29:01.437 +08:00] [WARN] [endpoint.rs:606] [error-response] [err=“Key is locked (will clean up) primary_lock: 7480000000000011135F698000000000000001013230323230393132FF323132345F593246FF7A61474A68593273FF7459574E3061585AFF6C4C584E6C636E5AFF70593255744D5441FF352E315F4E574D30FF5A6A6B354F574A68FF4E5441794E445135FF59324A6859546B31FF4E7A6C6C4D574579FF4D44493059324A41FF4D5463794C6A4977FF4C6A45774F433478FF4D446B3D00000000FB lock_version: 436101222559383593 key: 7480000000000011135F698000000000000003038000002F149A574C038000000000209EF2 lock_ttl: 3093 txn_size: 348 lock_type: Del min_commit_ts: 436101222559383594”]
[2022/09/19 21:29:01.481 +08:00] [INFO] [raft.rs:1336] [“received a message with higher term from 1418340”] [“msg type”=MsgRequestVote] [message_term=1074] [term=1073] [from=1418340] [raft_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.481 +08:00] [INFO] [raft.rs:1092] [“became follower at term 1074”] [term=1074] [raft_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.482 +08:00] [INFO] [raft.rs:1532] [“[logterm: 1073, index: 990874, vote: 0] cast vote for 1418340 [logterm: 1073, index: 990874] at term 1074”] [“msg type”=MsgRequestVote] [term=1074] [msg_index=990874] [msg_term=1073] [from=1418340] [vote=0] [log_index=990874] [log_term=1073] [raft_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.482 +08:00] [INFO] [apply.rs:1383] [“execute admin command”] [command=“cmd_type: ChangePeerV2 change_peer_v2 { changes { peer { id: 1418340 store_id: 4 } } changes { change_type: AddLearnerNode peer { id: 1418271 store_id: 1 role: Learner } } }”] [index=990874] [term=1073] [peer_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.482 +08:00] [INFO] [apply.rs:1936] [“exec ConfChangeV2”] [epoch=“conf_ver: 17526 version: 3466”] [kind=EnterJoint] [peer_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.482 +08:00] [INFO] [apply.rs:2116] [“conf change successfully”] [“current region”=“id: 822789 start_key: 748000000000020DFF0F5F728AC93DDCA1FF4000010000000000FA end_key: 74800000000005A0FFE75F720132303232FF30383331FF313536FF3439303237FF3635FF303939343137FF36FF30310000000000FAFF038000000001348BFF9F00000000000000F8 region_epoch { conf_ver: 17528 version: 3466 } peers { id: 1411028 store_id: 7 } peers { id: 1414389 store_id: 5 } peers { id: 1418271 store_id: 1 role: DemotingVoter } peers { id: 1418340 store_id: 4 role: IncomingVoter }”] [“original region”=“id: 822789 start_key: 748000000000020DFF0F5F728AC93DDCA1FF4000010000000000FA end_key: 74800000000005A0FFE75F720132303232FF30383331FF313536FF3439303237FF3635FF303939343137FF36FF30310000000000FAFF038000000001348BFF9F00000000000000F8 region_epoch { conf_ver: 17526 version: 3466 } peers { id: 1411028 store_id: 7 } peers { id: 1414389 store_id: 5 } peers { id: 1418271 store_id: 1 } peers { id: 1418340 store_id: 4 role: Learner }”] [changes=“[peer { id: 1418340 store_id: 4 }, change_type: AddLearnerNode peer { id: 1418271 store_id: 1 role: Learner }]”] [peer_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.488 +08:00] [INFO] [raft.rs:2609] [“switched to configuration”] [config=“Configuration { voters: Configuration { incoming: Configuration { voters: {1414389, 1411028, 1418340} }, outgoing: Configuration { voters: {1411028, 1414389, 1418271} } }, learners: {}, learners_next: {1418271}, auto_leave: false }”] [raft_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.507 +08:00] [INFO] [apply.rs:1383] [“execute admin command”] [command=“cmd_type: ChangePeerV2 change_peer_v2 {}”] [index=990876] [term=1074] [peer_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.507 +08:00] [INFO] [apply.rs:1936] [“exec ConfChangeV2”] [epoch=“conf_ver: 17528 version: 3466”] [kind=LeaveJoint] [peer_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.507 +08:00] [INFO] [apply.rs:2146] [“leave joint state successfully”] [region=“id: 822789 start_key: 748000000000020DFF0F5F728AC93DDCA1FF4000010000000000FA end_key: 74800000000005A0FFE75F720132303232FF30383331FF313536FF3439303237FF3635FF303939343137FF36FF30310000000000FAFF038000000001348BFF9F00000000000000F8 region_epoch { conf_ver: 17530 version: 3466 } peers { id: 1411028 store_id: 7 } peers { id: 1414389 store_id: 5 } peers { id: 1418271 store_id: 1 role: Learner } peers { id: 1418340 store_id: 4 }”] [peer_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.507 +08:00] [INFO] [apply.rs:1383] [“execute admin command”] [command=“cmd_type: ChangePeer change_peer { change_type: RemoveNode peer { id: 1418271 store_id: 1 role: Learner } }”] [index=990877] [term=1074] [peer_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.507 +08:00] [INFO] [apply.rs:1755] [“exec ConfChange”] [epoch=“conf_ver: 17530 version: 3466”] [type=RemoveNode] [peer_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.507 +08:00] [INFO] [apply.rs:1863] [“remove peer successfully”] [region=“id: 822789 start_key: 748000000000020DFF0F5F728AC93DDCA1FF4000010000000000FA end_key: 74800000000005A0FFE75F720132303232FF30383331FF313536FF3439303237FF3635FF303939343137FF36FF30310000000000FAFF038000000001348BFF9F00000000000000F8 region_epoch { conf_ver: 17530 version: 3466 } peers { id: 1411028 store_id: 7 } peers { id: 1414389 store_id: 5 } peers { id: 1418271 store_id: 1 role: Learner } peers { id: 1418340 store_id: 4 }”] [peer=“id: 1418271 store_id: 1 role: Learner”] [peer_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.508 +08:00] [INFO] [raft.rs:2609] [“switched to configuration”] [config=“Configuration { voters: Configuration { incoming: Configuration { voters: {1414389, 1411028, 1418340} }, outgoing: Configuration { voters: {} } }, learners: {1418271}, learners_next: {}, auto_leave: false }”] [raft_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.509 +08:00] [INFO] [raft.rs:2609] [“switched to configuration”] [config=“Configuration { voters: Configuration { incoming: Configuration { voters: {1414389, 1411028, 1418340} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }”] [raft_id=1414389] [region_id=822789]
[2022/09/19 21:29:01.629 +08:00] [ERROR] [peer.rs:553] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1417858] [region_id=450799]
[2022/09/19 21:29:01.629 +08:00] [ERROR] [peer.rs:553] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1378100] [region_id=1378099]
[2022/09/19 21:29:01.629 +08:00] [ERROR] [peer.rs:553] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=443696] [region_id=3632]
[2022/09/19 21:29:01.629 +08:00] [ERROR] [peer.rs:553] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=443696] [region_id=3632]
[2022/09/19 21:29:01.630 +08:00] [ERROR] [peer.rs:553] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1223182] [region_id=1193821]
[2022/09/19 21:29:01.630 +08:00] [ERROR] [peer.rs:553] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=1417858] [region_id=450799]
[2022/09/19 21:29:01.632 +08:00] [INFO] [split_check.rs:289] [“update approximate size and keys with accurate value”] [keys=481443] [size=114426511] [region_id=1375671]
[2022/09/19 21:29:01.797 +08:00] [INFO] [apply.rs:1383] [“execute admin command”] [command=“cmd_type: ChangePeerV2 change_peer_v2 { changes { peer { id: 1418341 store_id: 4 } } changes { change_type: AddLearnerNode peer { id: 1412642 store_id: 1 role: Learner } } }”] [index=43347] [term=53] [peer_id=1404274] [region_id=1386748]
[2022/09/19 21:29:01.797 +08:00] [INFO] [apply.rs:1936] [“exec ConfChangeV2”] [epoch=“conf_ver: 17208 version: 3917”] [kind=EnterJoint] [peer_id=1404274] [region_id=1386748]
[2022/09/19 21:29:01.797 +08:00] [INFO] [apply.rs:2116] [“conf change successfully”] [“current region”="id: 1386748 start_key: 7480000000000E94FF085F72038ACE4E5DFF380000010150524FFF4649545F48FF414EFF444645455F59FF48FF4B000000000000F9FF038000000001348BFFF700000000000000F8 region_epoch { conf_ver: 17210 version: 3917 } peers { id: 140427

| username: xfworld | Original post link

  1. What is the overall topology of the cluster?

  2. Which table does region 822789 belong to? The old voting is also not quite normal.

  3. According to the description in the image, does your TiKV cluster have 4 nodes? Even number? (That’s quite something)

| username: Inkjade | Original post link

Overall topology:

3 PD, 3 TiDB servers, 5 TiKV

| username: yilong | Original post link

  1. During the issue occurrence, was the CPU of the raft store in the tikv-detail monitoring also fully utilized?
  2. Did you check from the dashboard if there are any SQL queries consuming a lot of resources? For example, coprocessor.