How to Replace All PD Nodes in a Cluster

translator_bot · June 23, 2024, 8:21am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 如何换掉集群中的所有pd节点

| username: xxxxxxxx

Version: TiDB 4.0.13

Requirement: Replace all existing PD nodes in the cluster

For example, the current PD node information of the cluster is as follows:
192.168.1.1 (leader)
192.168.1.2
192.168.1.3

Need to replace with:
192.168.1.4
192.168.1.5
192.168.1.6

In the actual operation process, I first added two PD nodes, then removed an old one (non-leader node), then performed a PD leader switch to the newly added node, then removed another old PD, added one more PD, and finally removed the last old one.
The specific operations are as follows:

Add two PDs (192.168.1.4, 192.168.1.5)
Remove 192.168.1.2
Switch PD leader from 192.168.1.1 to 192.168.1.4 (member leader transfer)
Remove 192.168.1.3
Add one PD (192.168.1.6)
Remove 192.168.1.1

This operation process did not encounter any issues, and the business writes were normal, but some nodes appeared in a Disconnected state briefly, and the following two phenomena occurred:

The pump component reported Heartbeat-related errors, with the error message as follows:

%E5%9B%BE%E7%89%871380×442 743 KB

This error was resolved by restarting.
Executing tiup cluster display for the entire cluster was particularly slow, caused by the pump error, which was also resolved by restarting the pump.

So I would like to consult on how to operate for such a requirement.

translator_bot · June 23, 2024, 8:21am

| username: TiDBer_jYQINSnf | Original post link

I think this operation is fine. Once the client establishes a connection with PD, it continues to use it without a refresh mechanism. PD followers will forward requests and won’t require the client to resend to a new address. Only when a PD follower goes down will the client reconnect, which will result in a series of error messages.

translator_bot · June 23, 2024, 8:21am

| username: 长安是只喵 | Original post link

Normally, it should be replaced by scaling up or down. It doesn’t seem to be a problem.

translator_bot · June 23, 2024, 8:21am

| username: 履霜知冰 | Original post link

Replacing all PD nodes in the cluster can be achieved more reliably by performing multiple scale-out and scale-in operations.

translator_bot · June 23, 2024, 8:21am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. No new replies are allowed.