How to Troubleshoot Service Anomalies Caused by Restarting All TiKV Nodes in the Cluster?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 重启集群所有tikv节点导致服务异常,如何排查原因?

| username: DBRE

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.0.1
[Encountered Issue: Phenomenon and Impact]
Due to discovering a bug where TiKV panics after 795 days, as detailed at 一天多的时间里集群中4个tikv有3个重启了,tikv FATAL报错index out of bounds: the len is 6 but the index is 6 - TiDB 的问答社区, we proactively restarted the TiKV nodes in the cluster with the command: tiup cluster restart xxxx -R tikv. According to the log output, TiKV restarted successfully, as shown in the image below:


However, after the restart, the business QPS dropped significantly, and there were timeout reports from the business side. Monitoring showed an increase in slow SQL queries and a higher 999th percentile latency.

After the restart, running tiup cluster display xx showed multiple TiKV nodes with a status of Disconnected, and the specific nodes that were Disconnected varied with each display command.

Subsequently, we repeatedly stopped the entire cluster with tiup cluster stop xx and started it again with tiup cluster start xx, which restored the cluster to normal. What could be the reason for the abnormal behavior of TiKV after the restart? How should we troubleshoot this?

Cluster Characteristics: This cluster has a high write load (12k write QPS, 1.5k read QPS), a large data volume (25TB), and 12 TiKV nodes (physical machines).

[Attachments: Screenshots/Logs/Monitoring]

Logs from the TiKV nodes show many errors:
[2023/08/22 18:03:09.380 +08:00] [ERROR] [util.rs:416] [“request failed, retry”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { status: 4-DEADLINE_EXCEEDED, details: Some("Deadline Exceeded") }))”]
[2023/08/22 18:03:09.380 +08:00] [ERROR] [util.rs:416] [“request failed, retry”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { status: 4-DEADLINE_EXCEEDED, details: Some("Deadline Exceeded") }))”]
[2023/08/22 18:03:09.380 +08:00] [ERROR] [util.rs:416] [“request failed, retry”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { status: 4-DEADLINE_EXCEEDED, details: Some("Deadline Exceeded") }))”]
[2023/08/22 18:03:09.380 +08:00] [ERROR] [util.rs:416] [“request failed, retry”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { status: 4-DEADLINE_EXCEEDED, details: Some("Deadline Exceeded") }))”]

| username: tidb菜鸟一只 | Original post link

It’s better to restart the entire cluster than to restart TiKV nodes individually… I suggest using -N to restart TiKV nodes one by one…

| username: 像风一样的男子 | Original post link

KV has three replicas. When two of the KV replicas are unavailable, the data in the database will become inaccessible.

| username: DBRE | Original post link

Yes, I heard before that restart is a rolling restart for each node, but when I executed it, I found out that all nodes stop and then start all nodes. I was careless; I should have restarted one node at a time. I still need to practice more. :joy: :joy: :joy: :joy:

| username: DBRE | Original post link

Well, in this case, what kind of output would the TiKV logs have? I want to confirm the reason for the cluster traffic and some nodes being in a disconnected state after the restart.

| username: tidb菜鸟一只 | Original post link

Reload is rolling… Restart is full…

| username: DBRE | Original post link

Learned :sob: :sob: :sob: