The cluster status is very unstable

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群状态很不稳定

| username: TiDBer_uI8QIp1t

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
After installing the cluster, when using tiup cluster display tidb-cluster, the status of PD often changes to Down, and the status of TiKV changes to N/A. Does this happen to you as well? How do you solve it? It’s not just the status displaying like this; the cluster connection is indeed very unstable.



| username: xfworld | Original post link

Please provide the corresponding hardware configuration and network configuration information.

| username: Kongdom | Original post link

:flushed: This is beyond unstable. How is the server resource usage?

| username: tidb菜鸟一只 | Original post link

Is there a problem with the network?

| username: TiDBer_uI8QIp1t | Original post link

I think the resource usage is fine because I just finished the installation and haven’t run any tasks yet, so the usage in all aspects is low.

| username: WalterWj | Original post link

It seems that this error is related to network issues.

| username: TiDBer_uI8QIp1t | Original post link

3-node mixed deployment, each node has 16 cores, 64GB RAM, and a 500GB hard drive. Each server has KV, PD, and DB. The network is an intranet with 90Mb/s downstream and 75Mb/s upstream.

| username: 像风一样的男子 | Original post link

The heartbeat failed to send. Is there any network interception between the three servers?

| username: 这里介绍不了我 | Original post link

Which deployment method?

| username: 路在何chu | Original post link

Is the bandwidth insufficient? Did the actual instance crash?

| username: TiDBer_uI8QIp1t | Original post link

No, everything is on the intranet. There are no issues with ping and telnet.

| username: TiDBer_uI8QIp1t | Original post link

Crashed, but the status returned to normal after a while.

| username: TiDBer_uI8QIp1t | Original post link

3-node mixed deployment, using tiup for offline deployment.

| username: zhanggame1 | Original post link

Your internet speed is not sufficient. Normally, a 10Gbps connection is required, but a 1Gbps connection can be acceptable for testing. Your 100Mbps connection won’t work.

| username: TiDBer_uI8QIp1t | Original post link

The download speed can reach nearly a hundred, which should be considered gigabit broadband, right?

| username: xfworld | Original post link

When the bandwidth is fully utilized, it is easy for components to lose connection with each other. After exceeding the maximum heartbeat time and retrying multiple times without success, the status shown in your uploaded image will appear.

Try testing in a different environment; the current resources are insufficient…

| username: zhaokede | Original post link

Check if the network is stable and if there is any packet loss.

| username: yiduoyunQ | Original post link

Check the monitoring metrics under grafana PD – etcd.

| username: 小于同学 | Original post link

The network might be unstable.

| username: dba远航 | Original post link

Check the network, and especially check if the time is synchronized between all nodes.