The entire cluster suddenly went down and cannot be started

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群突然全部down掉,start起不来

| username: 月明星稀

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5
[Reproduction Path] Restart, unable to start
[Encountered Problem: Issue Phenomenon and Impact] All nodes are down and cannot be brought up
Error logs are as follows:
[2024/06/21 13:37:37.232 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Unavailable desc = server not started: rpc error: code = Unavailable desc = server not started”]
[2024/06/21 13:37:38.233 +08:00] [INFO] [client.go:168] [“server starts to synchronize with leader”] [server=pd-3.3.3.23-2379] [leader=pd-3.3.3.39-2379] [request-index=0]
[2024/06/21 13:37:38.233 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Unavailable desc = server not started: rpc error: code = Unavailable desc = server not started”]
[2024/06/21 13:37:39.234 +08:00] [INFO] [client.go:168] [“server starts to synchronize with leader”] [server=pd-3.3.3.23-2379] [leader=pd-3.3.3.39-2379] [request-index=0]
[2024/06/21 13:37:39.235 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Unavailable desc = server not started: rpc error: code = Unavailable desc = server not started”]
[2024/06/21 13:37:39.432 +08:00] [INFO] [trace.go:152] [“trace[2085091103] linearizableReadLoop”] [detail=“{readStateIndex:63; appliedIndex:64; }”] [duration=118.150015ms] [start=2024/06/21 13:37:39.314 +08:00] [end=2024/06/21 13:37:39.432 +08:00] [steps=“["trace[2085091103] ‘read index received’ (duration: 118.1459ms)","trace[2085091103] ‘applied index is now lower than readState.Index’ (duration: 3.371µs)"]”]
[2024/06/21 13:37:39.432 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=118.294517ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/br-stream/info/" range_end:"/tidb/br-stream/info0" revision:37 “] [response=“range_response_count:0 size:4”]
[2024/06/21 13:37:39.432 +08:00] [INFO] [trace.go:152] [“trace[1691299179] range”] [detail=”{range_begin:/tidb/br-stream/info/; range_end:/tidb/br-stream/info0; response_count:0; response_revision:37; }”] [duration=118.405893ms] [start=2024/06/21 13:37:39.314 +08:00] [end=2024/06/21 13:37:39.432 +08:00] [steps=”["trace[1691299179] ‘agreement among raft nodes before linearized reading’ (duration: 118.270316ms)"]”]
[2024/06/21 13:37:40.177 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-id=17315373243108170272] [retry-timeout=500ms]
[2024/06/21 13:37:40.235 +08:00] [INFO] [client.go:168] [“server starts to synchronize with leader”] [server=pd-3.3.3.23-2379] [leader=pd-3.3.3.39-2379] [request-index=0]
[2024/06/21 13:37:40.236 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Unavailable desc = server not started: rpc error: code = Unavailable desc = server not started”]
[2024/06/21 13:37:40.508 +08:00] [INFO] [trace.go:152] [“trace[1345035771] linearizableReadLoop”] [detail=“{readStateIndex:65; appliedIndex:65; }”] [duration=831.915351ms] [start=2024/06/21 13:37:39.676 +08:00] [end=2024/06/21 13:37:40.508 +08:00] [steps=“["trace[1345035771] ‘read index received’ (duration: 831.910612ms)","trace[1345035771] ‘applied index is now lower than readState.Index’ (duration: 3.848µs)"]”]
[2024/06/21 13:37:40.508 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=832.109277ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/br-stream/info/" range_end:"/tidb/br-stream/info0" revision:37 “] [response=“range_response_count:0 size:4”]
[2024/06/21 13:37:40.508 +08:00] [INFO] [trace.go:152] [“trace[1562924158] range”] [detail=”{range_begin:/tidb/br-stream/info/; range_end:/tidb/br-stream/info0; response_count:0; response_revision:37; }”] [duration=832.249636ms] [start=2024/06/21 13:37:39.676 +08:00] [end=2024/06/21 13:37:40.508 +08:00] [steps=”["trace[1562924158] ‘agreement among raft nodes before linearized reading’ (duration: 832.139673ms)"]”]
[2024/06/21 13:37:40.509 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=500.925554ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7382826522070064087/config" “] [response=“range_response_count:1 size:3670”]
[2024/06/21 13:37:40.509 +08:00] [INFO] [trace.go:152] [“trace[2025518191] range”] [detail=”{range_begin:/pd/7382826522070064087/config; range_end:; response_count:1; response_revision:37; }”] [duration=501.007594ms] [start=2024/06/21 13:37:40.008 +08:00] [end=2024/06/21 13:37:40.509 +08:00] [steps=”["trace[2025518191] ‘agreement among raft nodes before linearized reading’ (duration: 500.906239ms)"]”]
[2024/06/21 13:37:41.081 +08:00] [INFO] [trace.go:152] [“trace[494172490] linearizableReadLoop”] [detail=“{readStateIndex:68; appliedIndex:68; }”] [duration=169.486939ms] [start=2024/06/21 13:37:40.911 +08:00] [end=2024/06/21 13:37:41.081 +08:00] [steps=“["trace[494172490] ‘read index received’ (duration: 169.484512ms)","trace[494172490] ‘applied index is now lower than readState.Index’ (duration: 1.868µs)"]”]

| username: 这里介绍不了我 | Original post link

Refer to this Column - TiDB Cluster Database Disaster Recovery Manual | TiDB Community

| username: 我是吉米哥 | Original post link

You might want to first check for any startup repair options based on the error. Only consider a full recovery as a last resort.

| username: 月明星稀 | Original post link

Do you know why it suddenly crashed?

| username: 这里介绍不了我 | Original post link

Run tiup cluster display xxx to check the current status of your cluster.

| username: zhanggame1 | Original post link

Go down and check PD first.