TiKV service is running, failed to send extra message, unable to start TiKV

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv service 在运行,failed to send extra message 无法启动 tikv

| username: Steve阿辉

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.2
[Reproduction Path] Last time, after ucan started the cluster, it was normal. Checking the logs, I found that this error was reported on the morning of the 3rd to the 8th, resulting in a large number of error logs occupying disk space.
[Encountered Problem: Problem Phenomenon and Impact] I found that there was a problem with data balancing, where one node had tens of GB more than another node. However, monitoring showed that PD was scheduling, so I tried restarting the cluster, but one of the nodes couldn’t start. After a series of troubleshooting, I found that the disk imbalance issue was caused by log files, and the error logs were caused by “failed to send extra message”. How can I fix TiKV? Checking the node, the TiKV service has been running, and the CPU is around 30%.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

total 26G
17301515 4.0K . 17301555 301M tikv-2023-03-09T08-54-41.841.log 17301578 301M tikv-2023-03-10T01-23-05.728.log 17301601 301M tikv-2023-03-10T18-06-36.398.log
17301506 4.0K … 17301556 301M tikv-2023-03-09T09-35-18.365.log 17301579 301M tikv-2023-03-10T02-08-40.432.log 17301602 301M tikv-2023-03-10T18-50-29.603.log
17301532 301M tikv-2023-03-06T03-39-51.400.log 17301557 301M tikv-2023-03-09T10-16-04.712.log 17301580 301M tikv-2023-03-10T02-54-24.028.log 17301603 301M tikv-2023-03-10T19-35-29.952.log
17301535 301M tikv-2023-03-06T18-57-08.744.log 17301558 301M tikv-2023-03-09T10-57-03.234.log 17301581 301M tikv-2023-03-10T03-40-24.456.log 17301604 301M tikv-2023-03-10T20-21-33.700.log
17301536 301M tikv-2023-03-06T19-56-55.422.log 17301559 301M tikv-2023-03-09T11-38-26.964.log 17301582 301M tikv-2023-03-10T04-26-32.260.log 17301605 301M tikv-2023-03-10T21-08-31.723.log
17301537 301M tikv-2023-03-07T10-30-28.870.log 17301560 301M tikv-2023-03-09T12-20-00.914.log 17301583 301M tikv-2023-03-10T05-12-54.841.log 17301606 301M tikv-2023-03-10T21-56-31.343.log
17301538 301M tikv-2023-03-07T12-02-05.787.log 17301561 301M tikv-2023-03-09T13-01-23.824.log 17301584 301M tikv-2023-03-10T05-59-38.642.log 17301607 301M tikv-2023-03-10T22-45-30.679.log
[2023/03/11 10:33:57.460 +08:00] [INFO] [pd.rs:1393] [“try to change peer”] [peer=“id: 749208695 store_id: 1 role: Learner”] [change_type=RemoveNode] [region_id=749208694]
[2023/03/11 10:33:57.460 +08:00] [INFO] [peer.rs:4467] [“propose conf change peer”] [kind=Simple] [changes=“[change_type: RemoveNode peer { id: 749208695 store_id: 1 role: Learner }]”] [peer_id=749208697] [region_id=749208694]
[2023/03/11 10:33:57.460 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 39198 store_id: 1”] [peer_id=749191561] [region_id=39197] [type=MsgHibernateRequest]
[2023/03/11 10:33:57.460 +08:00] [INFO] [apply.rs:1396] [“execute admin command”] [command=“cmd_type: ChangePeer change_peer { change_type: RemoveNode peer { id: 749208695 store_id: 1 role: Learner } }”] [index=40] [term=27] [peer_id=749208697] [region_id=749208694]
[2023/03/11 10:33:57.460 +08:00] [INFO] [apply.rs:1770] [“exec ConfChange”] [epoch=“conf_ver: 22 version: 15823”] [type=RemoveNode] [peer_id=749208697] [region_id=749208694]
[2023/03/11 10:33:57.460 +08:00] [INFO] [apply.rs:1878] [“remove peer successfully”] [region=“id: 749208694 start_key: 7480000000000001FF0F5F698000000000FF00000E01616D617AFF6F6E2E69FF740000FF0000000000F80419FFAAB2000000000003FF8000000001137393FF0000000000000000F7 end_key: 7480000000000001FF0F5F698000000000FF00000E01616D617AFF6F6E2E69FF740000FF0000000000F80419FFAABC000000000003FF8000000016161A34FF0000000000000000F7 region_epoch { conf_ver: 22 version: 15823 } peers { id: 749208695 store_id: 1 role: Learner } peers { id: 749208696 store_id: 4 } peers { id: 749208697 store_id: 744798344 } peers { id: 753460446 store_id: 749350518 }”] [peer=“id: 749208695 store_id: 1 role: Learner”] [peer_id=749208697] [region_id=749208694]
[2023/03/11 10:33:57.460 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 162626 store_id: 1”] [peer_id=749186107] [region_id=162625] [type=MsgHibernateRequest]
[2023/03/11 10:33:57.460 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 749201158 store_id: 1”] [peer_id=749201160] [region_id=749201157] [type=MsgHibernateRequest]
[2023/03/11 10:33:57.460 +08:00] [INFO] [raft.rs:2646] [“switched to configuration”] [config=“Configuration { voters: Configuration { incoming: Configuration { voters: {749208696, 749208697, 753460446} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }”] [raft_id=749208697] [region_id=749208694]
[2023/03/11 10:33:57.460 +08:00] [INFO] [peer.rs:3476] [“notify pd with change peer region”] [region=“id: 749208694 start_key: 7480000000000001FF0F5F698000000000FF00000E01616D617AFF6F6E2E69FF740000FF0000000000F80419FFAAB2000000000003FF8000000001137393FF0000000000000000F7 end_key: 7480000000000001FF0F5F698000000000FF00000E01616D617AFF6F6E2E69FF740000FF0000000000F80419FFAABC000000000003FF8000000016161A34FF0000000000000000F7 region_epoch { conf_ver: 23 version: 15823 } peers { id: 749208696 store_id: 4 } peers { id: 749208697 store_id: 744798344 } peers { id: 753460446 store_id: 749350518 }”] [peer_id=749208697] [region_id=749208694]
[2023/03/11 10:33:57.460 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 184906 store_id: 1”] [peer_id=749156135] [region_id=184905] [type=MsgHibernateRequest]
[2023/03/11 10:33:57.461 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 228622 store_id: 1”] [peer_id=749200956] [region_id=228621] [type=MsgHibernateRequest]
[2023/03/11 10:33:57.462 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 168766 store_id: 1”] [peer_id=749196917] [region_id=168765] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.218 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 133926 store_id: 1”] [peer_id=749174133] [region_id=133925] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.218 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 64394 store_id: 1”] [peer_id=749193035] [region_id=64393] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.219 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 181338 store_id: 1”] [peer_id=747705560] [region_id=181337] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.221 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 19070 store_id: 1”] [peer_id=749201314] [region_id=19069] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.221 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 47754 store_id: 1”] [peer_id=749195078] [region_id=47753] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.222 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 49626 store_id: 1”] [peer_id=749182753] [region_id=49625] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.222 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 749252355 store_id: 1”] [peer_id=749252357] [region_id=749252354] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.222 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 123202 store_id: 1”] [peer_id=749202146] [region_id=123201] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.223 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 9894 store_id: 1”] [peer_id=749219146] [region_id=9893] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.223 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 135630 store_id: 1”] [peer_id=749164422] [region_id=135629] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.224 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 749252180 store_id: 1”] [peer_id=749252182] [region_id=749252179] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.224 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 749270009 store_id: 1”] [peer_id=749270501] [region_id=749270007] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.224 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 55930 store_id: 1”] [peer_id=749236927] [region_id=55929] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.224 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 160150 store_id: 1”] [peer_id=749185509] [region_id=160149] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.225 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 224947 store_id: 1”] [peer_id=749195915] [region_id=224946] [type=MsgHibernateRequest]
[2023/03/11 10:33:58.225 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Disconnected)] [target=“id: 34194 store_id: 1”] [peer_id=749193811] [region_id=34193] [type=MsgHibernateRequest]

| username: Steve阿辉 | Original post link

Earliest log trace

[2023/03/07 12:02:06.157 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749257801, leader may None" not_leader { region_id: 749257801 }))”] [cid=1726168]
[2023/03/07 12:02:06.157 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749257801, leader may None" not_leader { region_id: 749257801 }))”] [cid=1726169]
[2023/03/07 12:02:06.157 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749257801, leader may None" not_leader { region_id: 749257801 }))”] [cid=1726170]
[2023/03/07 12:02:06.157 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749257801, leader may None" not_leader { region_id: 749257801 }))”] [cid=1726195]
[2023/03/07 12:02:06.157 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749257801, leader may None" not_leader { region_id: 749257801 }))”] [cid=1726198]
[2023/03/07 12:02:06.157 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749257801, leader may None" not_leader { region_id: 749257801 }))”] [cid=1726196]
[2023/03/07 12:02:06.157 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749257801, leader may None" not_leader { region_id: 749257801 }))”] [cid=1726197]
[2023/03/07 12:02:06.157 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749257801, leader may None" not_leader { region_id: 749257801 }))”] [cid=1726199]
[2023/03/07 12:02:06.157 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749257801, leader may None" not_leader { region_id: 749257801 }))”] [cid=1726200]
[2023/03/07 12:02:06.157 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749257801, leader may None" not_leader { region_id: 749257801 }))”] [cid=1726194]
[2023/03/07 12:02:06.157 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749257801, leader may None" not_leader { region_id: 749257801 }))”] [cid=1726224]

[2023/03/07 12:02:06.159 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 155417, leader may Some(id: 749223088 store_id: 744798049)" not_leader { region_id: 155417 leader { id: 749223088 store_id: 744798049 } }”]
[2023/03/07 12:02:06.159 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 153173, leader may None" not_leader { region_id: 153173 }”]
[2023/03/07 12:02:06.159 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 153173, leader may None" not_leader { region_id: 153173 }”]

[2023/03/07 12:02:06.465 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 749244677, leader may None" not_leader { region_id: 749244677 }”]
[2023/03/07 12:02:06.465 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 749244677, leader may None" not_leader { region_id: 749244677 }”]
[2023/03/07 12:02:06.465 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 749244677, leader may None" not_leader { region_id: 749244677 }”]
[2023/03/07 12:02:06.465 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 749244677, leader may None" not_leader { region_id: 749244677 }”]
[2023/03/07 12:02:06.465 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 749244677, leader may None" not_leader { region_id: 749244677 }”]
[2023/03/07 12:02:06.465 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 749244677, leader may None" not_leader { region_id: 749244677 }”]
[2023/03/07 12:02:06.465 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 749244677, leader may None" not_leader { region_id: 749244677 }”]
[2023/03/07 12:02:06.465 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 749244677, leader may None" not_leader { region_id: 749244677 }”]
[2023/03/07 12:02:06.465 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 749265529, leader may Some(id: 749265530 store_id: 744798049)" not_leader { region_id: 749265529 leader { id: 749265530 store_id: 744798049 } }”]
[2023/03/07 12:02:06.465 +08:00] [WARN] [endpoint.rs:621] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 749272595, leader may None" not_leader { region_id: 749272595 }”]

leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))"] [cid=1726568]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726569]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726570]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726571]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726572]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726573]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726574]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726575]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726576]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726577]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726578]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726579]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726580]
[2023/03/07 12:02:07.132 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726581]
[2023/03/07 12:02:07.133 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726582]
[2023/03/07 12:02:07.133 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726583]
[2023/03/07 12:02:07.133 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726584]
[2023/03/07 12:02:07.133 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726604]
[2023/03/07 12:02:07.133 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726585]
[2023/03/07 12:02:07.133 +08:00] [INFO] [scheduler.rs:596] [“get snapshot failed”] [err=“Error(Request(message: "peer is not leader for region 749274257, leader may Some(id: 749274260 store_id: 744798049)" not_leader { region_id: 749274257 leader { id: 749274260 store_id: 744798049 } }))”] [cid=1726605]

Starting from the 8th, the logs became

[2023/03/08 22:33:51.619 +08:00] [ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 38574 store_id: 1”] [peer_id=749180321] [region_id=38573] [type=MsgHibernateRequest]
[2023/03/08 22:33:51.620 +08:00] [ERROR] [peer.rs:4976] ["failed

| username: TiDBer_jYQINSnf | Original post link

Is the network of this node functioning properly?

| username: buddyyuan | Original post link

Check the network monitoring with node_exporter to see the latency.

| username: Steve阿辉 | Original post link

This node_exporter, all are down.

| username: ljluestc | Original post link

Here is the detailed script to fix all errors:

  1. Check the cluster status to ensure all nodes are running and accessible.
  2. Check the logs to identify the errors.
  3. For “get snapshot failed” errors with the message “peer is not leader for region”:
    a. Identify the region without a leader.
    b. Use the Raft consensus algorithm to elect a new leader for that region.
    c. Update the peer configuration to ensure each peer knows the new leader.
  4. For “Region error (will back off and retry)” errors with the message “peer is not leader for region”:
    a. Set a timer to wait for a period before retrying the operation.
    b. If the operation fails again, exponentially increase the backoff time.
  5. Monitor the logs to ensure there are no more errors.

Here is the code to implement the above steps:

import time

def check_cluster_status():
    # Check the cluster status to ensure all nodes are running and accessible.
    pass

def identify_region_without_leader():
    pass

def elect_new_leader(region_id):
    pass

def update_peer_configuration(region_id, new_leader):
    pass

def wait_and_retry():
    # Set a timer to wait for a period before retrying the operation.
    time.sleep(1)

    backoff_time = 1
    while True:
        time.sleep(backoff_time)
        backoff_time *= 2
        if perform_operation():
            break

def perform_operation():
    pass

def fix_errors():
    check_cluster_status()

    for line in log:
        if "get snapshot failed" in line and "peer is not leader for region" in line:
            region_id = identify_region_without_leader(line)
            new_leader = elect_new_leader(region_id)
            update_peer_configuration(region_id, new_leader)
        elif "Region error (will back off and retry)" in line and "peer is not leader for region" in line:
            wait_and_retry()
| username: h5n1 | Original post link

Check the status of this region using pd-ctl region *region_id*, it seems that there is no leader.

What is the log when TiKV fails to start?