The entire cluster's PD and TiKV suddenly went down

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群的pd和tikv突然全部挂掉

| username: 月明星稀

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] No specific operations were performed, it suddenly crashed while running
[Encountered Problem: Phenomenon and Impact] Using tiup to check, all pd-servers are down, and all tikv-servers are N/A. However, checking the processes with ps shows that tikv and pd processes are still running and have not exited.
[Attachments: Screenshots/Logs/Monitoring]
No ERROR information found in pd logs, but it keeps printing:
[2024/01/03 19:28:09.390 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:09.677 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:09.904 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.107 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.377 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.579 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.780 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.982 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.183 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.479 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.681 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.882 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.083 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.285 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.487 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.688 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.889 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:13.091 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]

tikv error logs:
[2024/01/03 19:34:07.095 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.095 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.205 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.206 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.206 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.278 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.281 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.282 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.297 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.297 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.305 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.474 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.482 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.485 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.497 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.497 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.583 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.590 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.590 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.591 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.699 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.699 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.786 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.900 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.900 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.974 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.076 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.100 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:08.101 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:08.273 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.273 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.274 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.279 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.282 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.282 +08:00] [ERROR] [util.rs:

| username: FutureDB | Original post link

Are the components deployed in a mixed manner or independently?

| username: 江湖故人 | Original post link

You can check if NTP is functioning properly and whether the system-level network and load are stable.

| username: 烂番薯0 | Original post link

It looks like it’s a mixed deployment, right?

| username: dba远航 | Original post link

There was an exception when TiKV contacted PD to obtain TSO. Check the reason, could it be caused by time synchronization issues?

| username: xfworld | Original post link

It looks like a network issue.

| username: 像风一样的男子 | Original post link

Is it possible that a firewall or something similar is blocking the communication between KV and PD?

| username: zhanggame1 | Original post link

Check the monitoring, is the memory insufficient?

| username: xingzhenxiang | Original post link

Try reloading the PD node.

| username: 路在何chu | Original post link

It looks like a communication failure.

| username: 路在何chu | Original post link

What does the TiKV log print?

| username: 月明星稀 | Original post link

Hybrid deployment

| username: 月明星稀 | Original post link

In the same data center, there shouldn’t be any network issues in theory.

| username: 月明星稀 | Original post link

How much time difference would cause this issue?

| username: 路在何chu | Original post link

It is estimated that there are not enough resources, causing a crash.

| username: 月明星稀 | Original post link

If resources are insufficient, will there be error logs?

| username: 连连看db | Original post link

There may be some information, such as insufficient memory, TiDB might print an error message like “fatal error: runtime: out of memory” in the logs. If the disk space is insufficient, you might see “no space left on device”. However, the scenarios encountered are generally more complex, and most of the time, errors will be printed when executing code functions.

| username: TIDB-Learner | Original post link

In a production environment, if an issue suddenly arises, you can check the network (manual adjustments) and disk capacity (such as the root directory being filled with logs).