All TiKV Nodes Report Error [failed to send extra message]

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV各节点都提示错误【failed to send extra message】

| username: Johnnes_Xnn

【TiDB Environment】Production
【TiDB Version】7.1.0
【Reproduction Path】Continuous errors
【Encountered Issue: TiKV nodes report error [failed to send extra message]】
【TiKV Logs】
[2024/01/10 23:12:53.145 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=Grpc(RemoteStopped)] [region_id=23107861] [to_addr=172.25.227.51:20170]
[2024/01/10 23:12:53.148 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: })))”] [region_id=23101477] [to_addr=172.25.227.51:20170]
[2024/01/10 23:12:53.149 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: })))”] [region_id=23109861] [to_addr=172.25.227.51:20170]
[2024/01/10 23:12:53.150 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: })))”] [region_id=23101465] [to_addr=172.25.227.51:20170]
[2024/01/10 23:13:59.211 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: })))”] [region_id=23092977] [to_addr=172.25.227.51:20170]
[2024/01/10 23:13:59.212 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFailure(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: }))”] [region_id=23094445] [to_addr=172.25.227.51:20170]
[2024/01/10 23:13:59.212 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: })))”] [region_id=23092865] [to_addr=172.25.227.51:20170]
[2024/01/10 23:14:07.319 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: })))”] [region_id=23096025] [to_addr=172.25.227.51:20170]
[2024/01/10 23:14:08.727 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: })))”] [region_id=23109213] [to_addr=172.25.227.51:20170]
[2024/01/10 23:14:42.200 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: })))”] [region_id=23092101] [to_addr=172.25.227.51:20170]
[2024/01/10 23:14:42.390 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 1-CANCELLED, message: "CANCELLED", details: })))”] [region_id=23094665] [to_addr=172.25.227.51:20170]
[2024/01/10 23:17:01.316 +08:00] [ERROR] [transport.rs:99] [“failed to send significant msg”] [msg=“RaftlogFetched(FetchedLogs { context: GetEntriesContext(SendAppend { to: 23091188, term: 7, aggressively: false }), logs: RaftlogFetchResult { ents: Ok(), low: 6, max_size: 1048576, hit_size_limit: true, tried_cnt: 1, term: 7 } })”]
[2024/01/10 23:17:27.779 +08:00] [ERROR] [pd.rs:2393] [“send request failed”] [err=“"Disconnected(…)"”] [cmd_type=PrepareMerge] [region_id=23091037]
[2024/01/10 23:25:08.617 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 19665127 store_id: 1”] [peer_id=19665128] [region_id=19665126] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 231563 store_id: 1”] [peer_id=4427582] [region_id=231561] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 23091072 store_id: 4744287”] [peer_id=23091070] [region_id=23091069] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5034971 store_id: 4744287”] [peer_id=4436417] [region_id=15841] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5026146 store_id: 4744287”] [peer_id=4436414] [region_id=4427171] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5034352 store_id: 4744287”] [peer_id=4431039] [region_id=18865] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5034573 store_id: 4744287”] [peer_id=4432826] [region_id=21889] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.489 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 20084015 store_id: 1”] [peer_id=20084014] [region_id=20084013] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.623 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 44369 store_id: 1”] [peer_id=4434098] [region_id=44367] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 23065141 store_id: 4744287”] [peer_id=23065139] [region_id=23065138] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5035593 store_id: 4744287”] [peer_id=19586103] [region_id=4424257] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 23058743 store_id: 1”] [peer_id=23058742] [region_id=23058741] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 19969866 store_id: 1”] [peer_id=19969865] [region_id=19969864] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 9286837 store_id: 1”] [peer_id=4430861] [region_id=73773] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 19897511 store_id: 4744287”] [peer_id=19897509] [region_id=19897508] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.643 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5026285 store_id: 4744287”] [peer_id=4428307] [region_id=25101] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.643 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 20091724 store_id: 4744287”] [peer_id=20091722] [region_id=20091721] [type=MsgHibernateRequest]

| username: tidb狂热爱好者 | Original post link

What did you do?

| username: xfworld | Original post link

It looks like the connection was lost, and the cluster is not functioning properly.

| username: TIDB-Learner | Original post link

Documentation on snapshots: TiKV 源码解析系列文章(十)Snapshot 的发送和接收 | PingCAP

| username: Fly-bird | Original post link

Check if there are any network issues at this node.

| username: Johnnes_Xnn | Original post link

Ping and telnet ports between clusters are all normal.

| username: Fly-bird | Original post link

Have you tried restarting a single TiKV using tiup?

| username: Johnnes_Xnn | Original post link

Previously, there was an issue where another TiKV was started on the same machine, and then the original one on port 20160 started normally again. Later, the original 20160 was taken offline, and 20170 has been used since then. I’m not sure if this is the reason, but the TiKV node was restarted in the meantime.

| username: Johnnes_Xnn | Original post link

I have restarted it, but it still reports this error.

| username: Johnnes_Xnn | Original post link

One of the TiKV nodes was restarted. However, after coming back up, it still keeps reporting this error.

| username: tidb菜鸟一只 | Original post link

Are 4744287 and 1 the two stores you mentioned?

| username: Kongdom | Original post link

How is the cluster deployed? Please share the cluster display.

| username: Johnnes_Xnn | Original post link

Yesterday, I restarted 51 nodes, and the nodes that couldn’t go offline before have also gone offline. I checked the logs this morning, and there are no errors anymore. Just messing around.

| username: wangccsy | Original post link

Blocked and unable to send messages.

| username: dba远航 | Original post link

You need to describe what to do, so it can be better judged.