The entire cluster becomes unavailable due to PD Server network latency or packet loss

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PD Server网络延迟或丢包,整个集群无法可用

| username: Hacker_Yv76YjBL

[TiDB Usage Environment] POC
[TiDB Version] 7.1.5
[Reproduction Path]
date && ./blade create network delay --time 3000 --offset 500 --interface eth0 --local-port 2379
Fri Jun 7 11:35:56 CST 2024
{“code”:200,“success”:true,“result”:“81a33dbacbe665a6”}
[Encountered Problem: Phenomenon and Impact]
When simulating network delay on the leader node of the PD server, the entire cluster becomes unavailable. Additionally, the tiup command cannot be executed. I would like to understand how to proceed if the PD Server Leader experiences network delay or packet loss.
[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page


[Attachments: Screenshots/Logs/Monitoring]
PD Server Logs:
[2024/06/07 11:38:07.294 +08:00] [INFO] [operator_controller.go:681] [“send schedule command”] [region-id=1159170] [step=“add learner peer 41444669 on store 3941762”] [source=“active push”]
[2024/06/07 11:38:07.794 +08:00] [INFO] [operator_controller.go:681] [“send schedule command”] [region-id=2215918] [step=“add learner peer 41444617 on store 3941762”] [source=“active push”]
[2024/06/07 11:38:07.794 +08:00] [INFO] [operator_controller.go:681] [“send schedule command”] [region-id=2005331] [step=“add learner peer 41444642 on store 3941763”] [source=“active push”]
[2024/06/07 11:38:07.794 +08:00] [INFO] [operator_controller.go:681] [“send schedule command”] [region-id=889315] [step=“add learner peer 41444654 on store 3941763”] [source=“active push”]
[2024/06/07 11:38:07.794 +08:00] [INFO] [operator_controller.go:681] [“send schedule command”] [region-id=2202636] [step=“add learner peer 41444664 on store 3941763”] [source=“active push”]
[2024/06/07 11:38:07.794 +08:00] [INFO] [operator_controller.go:681] [“send schedule command”] [region-id=233868] [step=“add learner peer 41444657 on store 3941762”] [source=“active push”]
[2024/06/07 11:38:07.794 +08:00] [INFO] [operator_controller.go:681] [“send schedule command”] [region-id=165246] [step=“add learner peer 41444652 on store 3941763”] [source=“active push”]
[2024/06/07 11:38:08.294 +08:00] [INFO] [operator_controller.go:681] [“send schedule command”] [region-id=2136399] [step=“add learner peer 41444667 on store 3941762”] [source=“active push”]
[2024/06/07 11:38:08.294 +08:00] [INFO] [operator_controller.go:681] [“send schedule command”] [region-id=898673] [step=“use joint consensus, promote learner peer 41444668 on store 3941763 to voter, demote voter peer 898676 on store 2 to learner”] [source=“active push”]

| username: tidb菜鸟一只 | Original post link

If there is a network failure with the leader PD in a 3-node PD setup and the other nodes cannot connect to it for a certain period of time, the other PDs will elect a new leader, right?

| username: vincentLi | Original post link

Distributed systems all have this problem. Currently, there are no effective solutions. However, on the flip side, ensuring software high availability through hardware high availability has become a necessary means to avoid such situations. For example, it is best to place PD in the same availability zone, and whether this availability zone can ensure high availability through dual network cards and dual networks.

| username: TiDBer_H5NdJb5Q | Original post link

Good question, can we choose CP or AP?

| username: xfworld | Original post link

It is best to simulate PD failure. If it is half-dead, it may be the most difficult to achieve node switching and could lead to cluster unavailability.

| username: zhaokede | Original post link

Still need to solve the network issue.

| username: YuchongXU | Original post link

Solve packet loss issues.