V7.5.0 Incomplete PD Scale-In Causes Data Confusion in Two Clusters

translator_bot · June 20, 2024, 8:33pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: v7.5.0 scale-in pd不彻底，导致两个群集数据混乱

| username: ycong

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] After successfully scaling in a PD node, the cluster still reconnects to this PD node.
[Encountered Problem: Problem Phenomenon and Impact] If this offline node is scaled out to a new PD cluster, the two clusters will merge, causing data confusion.

10.25.248.131:2380 (VMS584328) previously belonged to the tikv-oversea cluster. On 2024/04/08 10:19:26, 10.25.248.131:2380 was scaled in, and tiup cluster display tikv-oversea already showed that 10.25.248.131:2380 was removed. Subsequently, the server VMS584328 was taken offline. However, the pd.log shows that tikv-oversea is still connecting to 10.25.248.131:2380 and reporting connection errors, which continued until 2024/04/10.

On 2024/04/10, a new server VMS602679 was launched, coincidentally reusing the IP 10.25.248.131. On 2024/04/10 13:47, 10.25.248.131:2380 (VMS602679) was scaled out to the tikv-dal-test cluster, making the tikv-dal-test cluster a 3+1 mode. At this time, the 6 nodes of tikv-oversea also reconnected to 10.25.248.131:2380, forming a 6+1 mode. Subsequently, the 3+1+6, 10 PD nodes were all connected, forming a 10-node PD cluster, causing data confusion.

tikv-oversea
10.109.220.10:2379
10.109.220.9:2379
10.25.248.208:2379
10.25.248.246:2379
10.58.228.76:2379
10.58.228.86:2379

tikv-dal-test
10.58.228.37
10.109.216.124
10.25.248.212

tikv-oversea pd log:
[2024/04/07 18:37:25.977 +08:00] [INFO] [etcdutil.go:309] [“update endpoints”] [num-change=7->8] [last-endpoints=“[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.9:2379,http://10.109.220.10:2379,http://10.25.248.246:2379,http://10.25.248.131:2379,http://10.25.249.164:2379]”] [endpoints=“[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.25.248.246:2379,http://10.109.220.9:2379,http://10.25.248.131:2379,http://10.25.249.164:2379,http://10.25.248.208:2379]”]
[2024/04/08 10:19:26.254 +08:00] [INFO] [cluster.go:422] [“removed member”] [cluster-id=468758231b5b0393] [local-member-id=edff54aa33575887] [removed-remote-peer-id=f67c161a4e9b9cb8] [removed-remote-peer-urls=“[http://10.25.248.131:2380]”]
[2024/04/08 10:19:27.958 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting…”]
[2024/04/08 10:19:27.958 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting…”]
…
[2024/04/09 14:46:33.395 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection timed out". Reconnecting…”]
[2024/04/09 14:49:25.265 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: i/o timeout". Reconnecting…”]
…
[2024/04/10 13:44:05.323 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting…”]
[2024/04/10 13:45:57.545 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting…”]
[2024/04/10 13:46:21.890 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting…”]
[2024/04/10 13:47:58.088 +08:00] [INFO] [etcdutil.go:309] [“update endpoints”] [num-change=6->7] [last-endpoints=“[http://10.25.248.246:2379,http://10.58.228.76:2379,http://10.25.248.208:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.109.220.9:2379]”] [endpoints=“[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.109.220.9:2379,http://10.25.248.208:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]”]
[2024/04/10 13:48:08.085 +08:00] [INFO] [etcdutil.go:309] [“update endpoints”] [num-change=6->7] [last-endpoints=“[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.9:2379,http://10.109.220.10:2379,http://10.25.248.246:2379,http://10.25.248.208:2379]”] [endpoints=“[http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.25.248.208:2379,http://10.58.228.76:2379,http://10.109.220.9:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]”]
[2024/04/10 13:48:18.090 +08:00] [INFO] [etcdutil.go:309] [“update endpoints”] [num-change=7->10] [last-endpoints=“[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.109.220.9:2379,http://10.25.248.208:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]”] [endpoints=“[http://10.109.220.10:2379,http://10.58.228.76:2379,http://10.109.220.9:2379,http://10.58.228.86:2379,http://10.109.216.124:2379,http://10.25.248.212:2379,http://10.25.248.208:2379,http://10.58.228.37:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]”]

translator_bot · June 20, 2024, 8:33pm

| username: ycong | Original post link

Subsequent testing of scale-in PD node will continuously reconnect to the removed PD node, printing WARN errors.
Transferring the PD leader does not resolve the error, but reloading eliminates the error.
There are two requirements:

The display should show that the PD node has been completely deleted, without showing the tombstone status, as this can mislead operations personnel into thinking the PD node has been fully decommissioned.
It should not continuously reconnect to the removed PD node. This should be achievable after pushing the metadata.

translator_bot · June 20, 2024, 8:33pm

| username: TiDBer_jYQINSnf | Original post link

It is impossible to merge them because different clusters have different cluster_ids, at most it will just result in an error.

translator_bot · June 20, 2024, 8:33pm

| username: h5n1 | Original post link

Is there any more information about the scale-in process?

translator_bot · June 20, 2024, 8:33pm

| username: WalterWj | Original post link

The official website lacks corresponding instructions and operations: Update scale-tidb-using-tiup.md by easonn7 · Pull Request #16743 · pingcap/docs-cn
We can proceed once this PR is merged.

translator_bot · June 20, 2024, 8:33pm

| username: WalterWj | Original post link

This issue update prometheus config when scale in by Yujie-Xie · Pull Request #2387 · pingcap/tiup · GitHub was partially fixed in tiup 1.15, specifically regarding scaling operations and refreshing Prometheus.

Today, I carefully reviewed the related logic in tiup and found that while we refresh the configuration information when scaling down PD, we do not refresh the run scripts in the cluster. We will update the official documentation on this part, and in the long term, we will fix this issue in tiup.

Relevant issues and PRs will be updated here when available.