PD leader in virtual machine deployment abnormally down due to IO issues

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 虚拟机部署 pd leader 因为io 问题异常down

| username: 林夕一指

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] V6.5
[Reproduction Path] The TiDB cluster is first deployed by creating virtual servers using VMs. Disk IO stress testing is performed on the physical disk where the PD leader node is located, causing the IO usage rate of the PD leader server to approach 100%. Soon after, the tiup cluster display shows that all PD components are down, and the database service is unavailable.
[Encountered Problem: Symptoms and Impact] PD components are down, and TiDB is inaccessible.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

Under the PD panel etcd

In the initial stages of stress, the pd.log shows a large number of “redirect but server is not leader” error messages. I’m curious to ask if this error indicates a problem with the PD leader election.

| username: Soysauce520 | Original post link

It should be cut off. If PD is completely down, there will be no metadata.

| username: Daniel-W | Original post link

When a request sent to the PD leader is received but it is found that it has been demoted to a follower, the situation of ‘redirect but server is not leader’ usually occurs.

| username: tidb菜鸟一只 | Original post link

No, if you perform an IO test on the machine where the PD leader is located and there is a problem, shouldn’t PD switch the leader node to another server? Are all PD nodes using the same physical disk?

| username: 像风一样的男子 | Original post link

I guess your PD services are all deployed on different virtual machines on the same physical machine. The IO stress test might have crashed this physical machine. Check with tiup display to see if all PDs are down?

| username: 林夕一指 | Original post link

Yes, the display is completely down.

| username: 林夕一指 | Original post link

It’s virtualized with a VM, and there’s just one vSAN inside. To be honest, I don’t even know which disk it’s using.

| username: 像风一样的男子 | Original post link

Try restarting the cluster.

| username: 林夕一指 | Original post link

I recovered the cluster by restarting it. I’m just looking into the specific reason. :upside_down_face:

| username: 像风一样的男子 | Original post link

You are using a VM. If the virtual machines are on the same server, they share the disk. When you stress test on one virtual server, the entire physical server will be under pressure. If it crashes, it crashes together.

| username: zhanggame1 | Original post link

It is recommended to separate disks for TiDB deployment, and virtual machines can use physical disk passthrough.

| username: tidb菜鸟一只 | Original post link

In this case, your 3pd is actually useless. If one crosses, the others will also cross, and you can’t switch. The cluster definitely won’t be able to provide services externally.