TiDB Restart Failure

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb重启失败

| username: 最强王者

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] tidb 6.5.2
[Reproduction Path] Operations performed that led to the issue
TiKV node crashed, failed to restart, error indicating inability to communicate with PD
[Encountered Issue: Symptoms and Impact]
[Resource Configuration] Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

[2024/04/07 13:11:53.334 +08:00] [ERROR] [pd.rs:663] [“failed to send split infos to pd worker”] [err=“channel has been closed”]
[2024/04/07 13:11:53.359 +08:00] [ERROR] [pd.rs:690] [“failed to send min resolved ts to pd worker”] [err=“channel has been closed”]
[2024/04/07 13:11:54.041 +08:00] [INFO] [util.rs:260] [“update pd client”] [via=] [leader=http://10.10.40.24:2379] [prev_via=] [prev_leader=http://10.10.40.24:2379]
[2024/04/07 13:11:54.041 +08:00] [WARN] [util.rs:268] [“PD client refresh region heartbeat”] [takes=2498]
[2024/04/07 13:11:54.041 +08:00] [INFO] [util.rs:394] [“trying to update PD client done”] [spend=783.247373497s]
[2024/04/07 13:11:54.041 +08:00] [INFO] [util.rs:763] [“connected to PD member”] [endpoints=http://10.10.40.24:2379]
[2024/04/07 13:11:54.041 +08:00] [INFO] [util.rs:220] [“heartbeat sender and receiver are stale, refreshing …”]
[2024/04/07 13:11:54.041 +08:00] [INFO] [util.rs:233] [“buckets sender and receiver are stale, refreshing …”]
[2024/04/07 13:11:54.044 +08:00] [ERROR] [client.rs:652] [“failed to send heartbeat”] [err_code=KV:Pd:Grpc] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 1-CANCELLED, message: "CANCELLED", details: })))”]
[2024/04/07 13:11:54.085 +08:00] [INFO] [tso.rs:162] [“TSO worker terminated”] [receiver_cause=None] [sender_cause=None]
[2024/04/07 13:11:54.356 +08:00] [ERROR] [pd.rs:663] [“failed to send split infos to pd worker”] [err=“channel has been closed”]
[2024/04/07 13:11:54.378 +08:00] [ERROR] [pd.rs:690] [“failed to send min resolved ts to pd worker”] [err=“channel has been closed”]
[2024/04/07 13:11:55.397 +08:00] [ERROR] [pd.rs:663] [“failed to send split infos to pd worker”] [err=“channel has been closed”]
[2024/04/07 13:11:55.422 +08:00] [ERROR] [pd.rs:690] [“failed to send min resolved ts to pd worker”] [err=“channel has been closed”]
[2024/04/07 13:11:55.730 +08:00] [INFO] [] [“subchannel 0x7f035aa4b400 {address=ipv4:10.0.61.198:20161, args=grpc.client_channel_factory=0x7f036469c270, grpc.default_authority=10.0.61.198:20161, grpc.default_compression_algorithm=0, grpc.gprc_min_message_size_to_compress=4096, grpc.gzip_compression_level=2, grpc.http2.lookahead_bytes=2097152, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f0364638950, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_reconnect_backoff_ms=5000, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f03646bf110, grpc.server_uri=dns:///10.0.61.198:20161, random id=595}: failed to connect to channel, retrying”]
[2024/04/07 13:11:56.432 +08:00] [ERROR] [pd.rs:663] [“failed to send split infos to pd worker”] [err=“channel has been closed”]
[2024/04/07 13:11:56.459 +08:00] [ERROR] [pd.rs:690] [“failed to send min resolved ts to pd worker”] [err=“channel has been closed”]
[2024/04/07 13:11:56.489 +08:00] [INFO] [util.rs:260] [“update pd client”] [via=] [leader=http://10.10.40.24:2379] [prev_via=] [prev_leader=http://10.10.40.24:2379]

| username: Jolyne | Original post link

Is the network connected? It seems that it can’t reach PD.

| username: zhanggame1 | Original post link

First, use tiup cluster display to check the status of each component in the cluster.

| username: Billmay表妹 | Original post link

How many TiKV?

| username: Billmay表妹 | Original post link

[Resource Allocation] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page.
Take a look at this.

| username: 最强王者 | Original post link

The network is connected.

| username: DBAER | Original post link

Can’t access PD, it’s a bit strange. How about manually accessing [http://10.40.24:2379]

| username: 最强王者 | Original post link

The image is not visible. Please provide the text you need translated.

| username: Jolyne | Original post link

Post the cluster topology for us to see.

| username: TiDBer_QYr0vohO | Original post link

It seems that the PD server is not accessible.

| username: tony5413 | Original post link

This can’t be accessed, please check the network.

| username: 友利奈绪 | Original post link

It seems like the PD server is not accessible.

| username: Kamner | Original post link

It looks like the heartbeat is not working.

tiup cluster display XXX
tikv server: telnet 10.10.40.24 2379
Also, check the firewall and iptables.
| username: 最强王者 | Original post link

It is possible to access via telnet 10.10.40.24 2379.

| username: 最强王者 | Original post link

The problem was resolved after restarting the machine.

| username: GoodTiDBer | Original post link

Scale down, then scale up again.

| username: dba远航 | Original post link

PD is abnormal.

| username: mono | Original post link

Enable the magic mode. Restart the server! :smile:

| username: Swan | Original post link

Has the issue been resolved? What was the cause?

| username: zhaokede | Original post link

Restarting solves many problems.