This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群状态很不稳定

After installing the cluster, when using tiup cluster display tidb-cluster, the status of PD often changes to Down, and the status of TiKV changes to N/A. Does this happen to you as well? How do you solve it? It’s not just the status displaying like this; the cluster connection is indeed very unstable.

Please provide the corresponding hardware configuration and network configuration information.

:flushed: This is beyond unstable. How is the server resource usage?

Is there a problem with the network?

I think the resource usage is fine because I just finished the installation and haven’t run any tasks yet, so the usage in all aspects is low.

It seems that this error is related to network issues.

3-node mixed deployment, each node has 16 cores, 64GB RAM, and a 500GB hard drive. Each server has KV, PD, and DB. The network is an intranet with 90Mb/s downstream and 75Mb/s upstream.

The heartbeat failed to send. Is there any network interception between the three servers?

Which deployment method?

Is the bandwidth insufficient? Did the actual instance crash?

No, everything is on the intranet. There are no issues with ping and telnet.

Crashed, but the status returned to normal after a while.

3-node mixed deployment, using tiup for offline deployment.

Your internet speed is not sufficient. Normally, a 10Gbps connection is required, but a 1Gbps connection can be acceptable for testing. Your 100Mbps connection won’t work.

The download speed can reach nearly a hundred, which should be considered gigabit broadband, right?

When the bandwidth is fully utilized, it is easy for components to lose connection with each other. After exceeding the maximum heartbeat time and retrying multiple times without success, the status shown in your uploaded image will appear.

Try testing in a different environment; the current resources are insufficient…

Check if the network is stable and if there is any packet loss.

Check the monitoring metrics under grafana PD – etcd.

The network might be unstable.

Check the network, and especially check if the time is synchronized between all nodes.