Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: 集群的pd和tikv突然全部挂掉

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] No specific operations were performed, it suddenly crashed while running
[Encountered Problem: Phenomenon and Impact] Using tiup to check, all pd-servers are down, and all tikv-servers are N/A. However, checking the processes with ps shows that tikv and pd processes are still running and have not exited.
[Attachments: Screenshots/Logs/Monitoring]
No ERROR information found in pd logs, but it keeps printing:
[2024/01/03 19:28:09.390 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:09.677 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:09.904 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.107 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.377 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.579 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.780 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.982 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.183 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.479 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.681 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.882 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.083 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.285 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.487 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.688 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.889 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:13.091 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
tikv error logs:
[2024/01/03 19:34:07.095 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.095 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.205 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.206 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.206 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.278 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.281 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.282 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.297 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.297 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.305 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.474 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.482 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.485 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.497 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.497 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.583 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.590 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.590 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.591 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.699 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.699 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.786 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.900 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.900 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.974 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.076 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.100 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:08.101 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:08.273 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.273 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.274 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.279 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.282 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.282 +08:00] [ERROR] [util.rs: