Unable to Start TiDB Cluster After Changing PD IP Address

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB集群修改PD的IP地址启动不了

| username: TiDBer_kfkPs53L

Changing the IP address of the PD in the cluster, but after the modification, the cluster cannot be started. This operation was performed exactly according to 【SOP Series 12】TiUP Modify Cluster IP, Port, and Directory, but it still doesn’t work.

Environment description: My PD has 3 nodes, which are 192.168.1.182, 192.168.1.183, 192.168.1.184, and I changed them to 192.168.1.192, 192.168.1.193, 192.168.1.194.

  1. Stop the cluster: tiup cluster stop tidb-test

  2. Modify /home/tidb/.tiup/storage/cluster/clusters/tidb-test

[tidb@tidb180 tidb-test]$ ls -ll meta.yaml 
-rw-r--r-- 1 tidb tidb 3065 Aug 19 21:44 meta.yaml
[tidb@tidb180 tidb-test]$ 

Modify meta.yaml

  1. Change the IP address and restart the network service.

  2. Start the cluster and report an error:

[tidb@tidb180 tidb-test]$ tiup cluster:v1.10.3 reload tidb-test -R pd --force
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.10.3/tiup-cluster reload tidb-test -R pd --force
Will reload the cluster tidb-test with restart policy is true, nodes: , roles: pd.
Do you want to continue? [y/N]:(default=N) Y
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-test/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-test/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=192.168.1.186
+ [Parallel] - UserSSH: user=tidb, host=192.168.1.187
+ [Parallel] - UserSSH: user=tidb, host=192.168.1.192
+ [Parallel] - UserSSH: user=tidb, host=192.168.1.180
+ [Parallel] - UserSSH: user=tidb, host=192.168.1.181
+ [Parallel] - UserSSH: user=tidb, host=192.168.1.180
+ [Parallel] - UserSSH: user=tidb, host=192.168.1.185
+ [Parallel] - UserSSH: user=tidb, host=192.168.1.194
+ [Parallel] - UserSSH: user=tidb, host=192.168.1.180
+ [Parallel] - UserSSH: user=tidb, host=192.168.1.193
+ [Parallel] - UserSSH: user=tidb, host=192.168.1.180
+ [ Serial ] - UpdateTopology: cluster=tidb-test
{"level":"warn","ts":"2022-08-19T21:53:12.896+0800","logger":"etcd-client","caller":"v3@v3.5.4/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005bce00/192.168.1.192:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.1.194:2379: connect: connection refused\""}

Error: context deadline exceeded

Verbose debug logs have been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2022-08-19-21-53-13.log
| username: xfworld | Original post link

You need to check if the network is smooth, and then check if the services of these PD nodes are still running. If not, you can directly restart them.

[tidb@localhost ~]$ tiup cluster reload tidb-test
[tidb@localhost ~]$ tiup cluster restart tidb-test

| username: TiDBer_kfkPs53L | Original post link

Hello!

  1. The network is definitely connected.
  2. The PD service is stopped because I stopped the cluster.
  3. [tidb@tidb180 tidb-test]$ tiup cluster reload tidb-test
    tiup is checking updates for component cluster …
    Starting component cluster: /home/tidb/.tiup/components/cluster/v1.10.3/tiup-cluster reload tidb-test
    Will reload the cluster tidb-test with restart policy is true, nodes: , roles: .
    Do you want to continue? [y/N]:(default=N) Y
  • [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-test/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-test/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=192.168.1.186
  • [Parallel] - UserSSH: user=tidb, host=192.168.1.187
  • [Parallel] - UserSSH: user=tidb, host=192.168.1.193
  • [Parallel] - UserSSH: user=tidb, host=192.168.1.180
  • [Parallel] - UserSSH: user=tidb, host=192.168.1.181
  • [Parallel] - UserSSH: user=tidb, host=192.168.1.185
  • [Parallel] - UserSSH: user=tidb, host=192.168.1.180
  • [Parallel] - UserSSH: user=tidb, host=192.168.1.192
  • [Parallel] - UserSSH: user=tidb, host=192.168.1.180
  • [Parallel] - UserSSH: user=tidb, host=192.168.1.180
  • [Parallel] - UserSSH: user=tidb, host=192.168.1.194
  • [ Serial ] - UpdateTopology: cluster=tidb-test
    {“level”:“warn”,“ts”:“2022-08-19T22:10:27.732+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.4/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0003be700/192.168.1.192:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.1.192:2379: connect: connection refused"”}

Error: context deadline exceeded
4. When executing tiup cluster restart tidb-test, it gets stuck at the following point, while everything else is successful:
Starting component pd
Starting instance 192.168.1.192:2379
Starting instance 192.168.1.193:2379
Starting instance 192.168.1.194:2379

| username: TiDBer_kfkPs53L | Original post link

The network itself is connected. May I ask if there are any other settings needed for the modifications to PD (meta.yaml has already been modified)?

[tidb@tidb182 ~]$ ping 192.168.1.194
PING 192.168.1.194 (192.168.1.194) 56(84) bytes of data.
64 bytes from 192.168.1.194: icmp_seq=1 ttl=63 time=0.682 ms
64 bytes from 192.168.1.194: icmp_seq=2 ttl=63 time=0.641 ms
^C
— 192.168.1.194 ping statistics —
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.641/0.661/0.682/0.032 ms
[tidb@tidb182 ~]$ ping 192.168.1.193
PING 192.168.1.193 (192.168.1.193) 56(84) bytes of data.
64 bytes from 192.168.1.193: icmp_seq=1 ttl=63 time=0.556 ms
^C
— 192.168.1.193 ping statistics —
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.556/0.556/0.556/0.000 ms
[tidb@tidb182 ~]$ ifconfig
ens192: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.1.192 netmask 255.255.255.255 broadcast 192.168.1.192
inet6 fe80::1cd1:5684:7d8:a460 prefixlen 64 scopeid 0x20
inet6 fe80::df2b:7371:4a97:e1a0 prefixlen 64 scopeid 0x20

| username: xfworld | Original post link

Check the firewall…

If you follow the modification process of tiup, there won’t be any issues.

| username: TiDBer_kfkPs53L | Original post link

The firewall is off. I changed the IP address back to the previous one and it worked. The firewall is definitely off [otherwise, why would changing the IP back to the previous one work]. So, is it possible that 【SOP Series 12】TiUP Modify Cluster IP, Port, and Directory has some modifications that were not documented elsewhere?

| username: xiaohetao | Original post link

Check if the mutual trust between the control machine and this machine is the same?

| username: wuxiangdong | Original post link

Manually start the PD node using Pd run_pd_xx.sh (also modify the IP)
Try starting the cluster.