The entire cluster's PD and TiKV suddenly went down

translator_bot · June 21, 2024, 7:54am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群的pd和tikv突然全部挂掉

| username: 月明星稀

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] No specific operations were performed, it suddenly crashed while running
[Encountered Problem: Phenomenon and Impact] Using tiup to check, all pd-servers are down, and all tikv-servers are N/A. However, checking the processes with ps shows that tikv and pd processes are still running and have not exited.
[Attachments: Screenshots/Logs/Monitoring]
No ERROR information found in pd logs, but it keeps printing:
[2024/01/03 19:28:09.390 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:09.677 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:09.904 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.107 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.377 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.579 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.780 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:10.982 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.183 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.479 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.681 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:11.882 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.083 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.285 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.487 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.688 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:12.889 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]
[2024/01/03 19:28:13.091 +08:00] [INFO] [server.go:1417] [“skip campaigning of pd leader and check later”] [server-name=pd-1.1.1.111-297] [etcd-leader-id=12167810431719769917] [member-id=13646096349724766299]

tikv error logs:
[2024/01/03 19:34:07.095 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.095 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.205 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.206 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.206 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.278 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.281 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.282 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.297 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.297 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.305 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.474 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.482 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.485 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.497 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.497 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.583 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.590 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.590 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.591 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.699 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.699 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.786 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:07.900 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:07.900 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:07.974 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.076 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.100 +08:00] [ERROR] [tso.rs:612] [“BatchTsoProvider::get_ts, renew_tso_batch fail on batch used-up”] [err=“BatchRenew("Pd unknown error \"[components/pd_client/src/tso.rs:97]: TimestampRequest channel is closed\"")”]
[2024/01/03 19:34:08.101 +08:00] [ERROR] [tso.rs:617] [“BatchTsoProvider::get_ts, batch used up”] [retries=0] [last_batch_size=3000]
[2024/01/03 19:34:08.273 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.273 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.274 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.279 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.282 +08:00] [ERROR] [util.rs:456] [“request failed, retry”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "not leader", details: }))”]
[2024/01/03 19:34:08.282 +08:00] [ERROR] [util.rs:

translator_bot · June 21, 2024, 7:54am

| username: FutureDB | Original post link

Are the components deployed in a mixed manner or independently?

translator_bot · June 21, 2024, 7:54am

| username: 江湖故人 | Original post link

You can check if NTP is functioning properly and whether the system-level network and load are stable.

translator_bot · June 21, 2024, 7:54am

| username: 烂番薯0 | Original post link

It looks like it’s a mixed deployment, right?

translator_bot · June 21, 2024, 7:54am

| username: dba远航 | Original post link

There was an exception when TiKV contacted PD to obtain TSO. Check the reason, could it be caused by time synchronization issues?

translator_bot · June 21, 2024, 7:54am

| username: xfworld | Original post link

It looks like a network issue.

translator_bot · June 21, 2024, 7:54am

| username: 像风一样的男子 | Original post link

Is it possible that a firewall or something similar is blocking the communication between KV and PD?

translator_bot · June 21, 2024, 7:54am

| username: zhanggame1 | Original post link

Check the monitoring, is the memory insufficient?

translator_bot · June 21, 2024, 7:54am

| username: xingzhenxiang | Original post link

Try reloading the PD node.

translator_bot · June 21, 2024, 7:54am

| username: 路在何chu | Original post link

It looks like a communication failure.

translator_bot · June 21, 2024, 7:54am

| username: 路在何chu | Original post link

What does the TiKV log print?

translator_bot · June 21, 2024, 7:54am

| username: 月明星稀 | Original post link

Hybrid deployment

translator_bot · June 21, 2024, 7:54am

| username: 月明星稀 | Original post link

In the same data center, there shouldn’t be any network issues in theory.

translator_bot · June 21, 2024, 7:54am

| username: 月明星稀 | Original post link

How much time difference would cause this issue?

translator_bot · June 21, 2024, 7:54am

| username: 路在何chu | Original post link

It is estimated that there are not enough resources, causing a crash.

translator_bot · June 21, 2024, 7:54am

| username: 月明星稀 | Original post link

If resources are insufficient, will there be error logs?

translator_bot · June 21, 2024, 7:54am

| username: 连连看db | Original post link

There may be some information, such as insufficient memory, TiDB might print an error message like “fatal error: runtime: out of memory” in the logs. If the disk space is insufficient, you might see “no space left on device”. However, the scenarios encountered are generally more complex, and most of the time, errors will be printed when executing code functions.

translator_bot · June 21, 2024, 7:54am

| username: TIDB-Learner | Original post link

In a production environment, if an issue suddenly arises, you can check the network (manual adjustments) and disk capacity (such as the root directory being filled with logs).