TiDB Expansion of PD Nodes: Conflict and Overwrite with Existing Cluster Ports

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB扩容PD节点,与已有集群相同端口冲突覆盖

| username: TiDBer_LWHNAerh

[TiDB Usage Environment] Production Environment
[TiDB Version] v4.0.9
[Reproduction Path] Operations performed that led to the issue
Cluster A expanded with 3 PD nodes on port 2379, without noticing that another cluster B also deployed PD nodes on port 2379 on these 3 machines, causing an overlap. Now both clusters are sharing these 3 PD nodes.
[Encountered Issue: Symptoms and Impact]
Cluster B cannot add PD nodes now. How to resolve this?
[Resource Configuration] Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: jaybing926 | Original post link

Overlay deployment probably won’t be successful, right? It will report a 2379 port conflict error, right?

| username: TiDBer_LWHNAerh | Original post link

Conflict detection will not be performed.

| username: h5n1 | Original post link

What is the current status of the two clusters? The estimated final solution is to deploy a new cluster containing only PD, then change the PD direction in the run_xx.sh script of TiDB and TiKV of cluster B to the new PD cluster, and use the PD recover method for recovery.

| username: TiDBer_LWHNAerh | Original post link

Do you have detailed steps?

| username: h5n1 | Original post link

Here is a rough procedure I did before, just a recovery, but now I’m not sure about the status and impact of your two clusters. PD recover requires recording the current cluster’s alloc-id and cluster-id.

Some reference documents:

| username: h5n1 | Original post link

Use tiup display to check the status of the current 2 clusters.

| username: TiDBer_LWHNAerh | Original post link

2379 is the PD node.

| username: h5n1 | Original post link

Check which PD each of the two clusters’ TiKV and TiDB are connected to using the command ps -ef | grep tidb.

| username: TiDBer_LWHNAerh | Original post link

Currently, each has its own previous PD cluster, no issues here.

| username: h5n1 | Original post link

The related process of A does not point to B’s PD? Modify A’s .tiup/storage/cluster/clusters/cluster/meta.yaml and delete B’s PD information, then tiup will not have it. If indeed all the tidb and tikv processes are still using the original PD, it should be fine.

| username: TiDBer_LWHNAerh | Original post link

Currently, the cluster and its operations are normal. However, after expanding cluster A with these 3 nodes, tiup display shows 3 additional PD nodes. Cluster B wants to expand by adding one PD node, but it cannot join the cluster.

| username: TiDBer_LWHNAerh | Original post link

After discovering the port conflict, I tried scale-in --node --force. Later, the new PD node in cluster B couldn’t be added to B, but the PD node in cluster A could be successfully expanded.

| username: h5n1 | Original post link

This should be the PD of B on cluster A, right? When you scale down from A, it will actually process the PD data of cluster B as well, which should be the one that is down above. Scale down on B and then scale it up again. The scale-down operation will actually process the PD files, so modify meta.yaml to make tiup not see it.

| username: TiDBer_LWHNAerh | Original post link

Yes, but currently, the downed node cannot be expanded on B. At present, there are 2 surviving PD nodes in cluster B, and the downed node cannot be added.

| username: h5n1 | Original post link

First scale down the B node.

| username: TiDBer_LWHNAerh | Original post link

After scaling down, the directories were all cleaned up, and then scaling up couldn’t rejoin the cluster.

| username: TiDBer_LWHNAerh | Original post link

The default value of tidb_enable_clustered_index is INT_ONLY, which means that only tables with integer primary keys will use clustered indexes by default.

| username: TiDBer_LWHNAerh | Original post link

Previously, a PD node was forcibly scaled down, leaving 2 PD nodes. Now, trying to scale up to 3 nodes, but it can’t be done.

| username: h5n1 | Original post link

A: How many PDs were there before the expansion operation? Did you only scale down one after expanding three conflicting ones?
B: Although this looks normal, it should actually still display A’s PD.