TiKV Node Failure Unable to Synchronize Snapshot

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV节点故障无法同步快照

| username: TiDBer_zarFUlCo

[TiDB Usage Environment] Production Environment
[TiDB Version] 7.1
[Reproduction Path]
[Encountered Issues: Symptoms and Impact]

  1. Since April 1st, the cluster has frequently experienced brief write anomalies, with TiDB reporting various errors:

  2. On April 9th, after restarting the TiDB cluster, one of the five TiKV nodes had a high CPU load and SSH was inaccessible. According to Grafana logs, this node had been disconnected since March 25th. Since SSH was inaccessible, the virtual machine was forcibly shut down and restarted.

  3. When restarting the TiDB cluster, it was difficult to quickly start the TiKV servers (any node took a long time). Adding the --wait–timeout 3600 parameter was necessary to start the cluster.

  4. After the cluster restart, other TiKV nodes were slow to synchronize data to the original faulty node. In 22 hours, the disk usage of the original faulty node increased by only 200GB.


  5. Checking the tikv.log logs on each TiKV node revealed that the original faulty node frequently reported errors such as “failed to recv snapshot,” while other TiKV nodes reported “failed to send snap.”
    image

How can I troubleshoot and resolve these issues?

Additional Issue: The write speed of the entire cluster has become very slow. From April 1st to April 9th, it took nearly an hour to reinsert 3000 SQL statements that failed to write due to the fault. In a test cluster with 3 TiKV nodes and mechanical hard drives, it only took about 10 minutes.

Update on April 11th:
After carefully reviewing the logs, it was found that the normal cluster kept failing to send snapshots of one or several regions.
[ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 1-CANCELLED, message: "CANCELLED", details: })))”] [region_id=1352555781] [to_addr=192.168.10.11:20160]

| username: tidb菜鸟一只 | Original post link

Is there no problem with the network between your clusters?

| username: TiDBer_jYQINSnf | Original post link

Is it possible that this peer is no longer on the target node? Check the region to see if the three peers of this region have already been supplemented on other machines?

| username: TiDBer_zarFUlCo | Original post link

All nodes are on the intranet, and the ping between the faulty node and other nodes is around 0.1 milliseconds.

| username: TiDBer_jYQINSnf | Original post link

If you want to increase the speed, just increase the store limit. Specifically, increase the add-peer limit for this node and the remove-peer limit for other nodes.

| username: TiDBer_zarFUlCo | Original post link

Now there is a problem that the write speed of the entire cluster has become very slow. From April 1st to April 9th, it took nearly 1 hour to re-write 3000 SQL statements that failed to be written due to a fault. However, it only took about 10 minutes on a test cluster with 3 TiKV nodes and mechanical hard drives.

| username: TiDBer_jYQINSnf | Original post link

If the write is slow, you need to check further, for example, whether it is constantly retrying. Does the faulty node still have a leader? If it doesn’t work, expel the leader and see if the write speed improves.

| username: tidb菜鸟一只 | Original post link

Is your cluster running low on space? Try increasing both of these parameters.

| username: TiDBer_zarFUlCo | Original post link

The leaders are quite evenly distributed. I wonder if it would be better to directly scale down this node and then scale it back up?

| username: TiDBer_zarFUlCo | Original post link

The values of these two are 0.7 and 0.8 respectively. Each normal TiKV node still has about 780G/4.4T of remaining disk space. If needed, more TiKV nodes can be added.

| username: 这里介绍不了我 | Original post link

Check if the score of this TiKV is significantly lower compared to other nodes.

| username: TiDBer_jYQINSnf | Original post link

Check the TiDB log. If there are always error reports about this faulty node, you can rebuild it. However, before deleting it, check the PD page to see if there are any missing peers.

| username: TiDBer_zarFUlCo | Original post link

Yes

| username: tidb菜鸟一只 | Original post link

I suggest finding another machine to expand a TiKV node, and then scale down this faulty TiKV node. Do not expand it in place on this machine for now.

| username: TiDBer_fbU009vH | Original post link

Have you tried restarting the operating system?

| username: 这里介绍不了我 | Original post link

It is recommended to add a node first, and then take the faulty node offline.

| username: h5n1 | Original post link

Are you using a virtual machine? Have you checked if there are any issues with the virtual machine and the host machine, such as disk performance, CPU, etc.?

| username: 像风一样的男子 | Original post link

Is it possible that the firewall is blocking the communication?

| username: TiDBer_zarFUlCo | Original post link

We have considered this aspect because, previously, several TiKV nodes logged “no SSD device” when the cluster started, even though the cluster uses NVMe SSDs. Currently, since the server maintenance personnel are not available, we don’t have any solutions to check disk performance.

| username: TiDBer_zarFUlCo | Original post link

The firewall is not activated.