Help! The production cluster has crashed, and PD cannot elect a leader

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 救急!!!,生产集群崩了,pd没办法选举

| username: TiDBer_vC2lhR9G

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.0

[Encountered Problem: Phenomenon and Impact]

Accidentally expanded PD using an external IP address. Since the original cluster’s IPs are all internal, this caused PD connectivity issues. After the expansion, the cluster became inaccessible. I tried to shrink the newly expanded node or expand another internal node, but neither worked.

PD keeps trying to connect to this public PD, but this PD has already been taken down.

Attempting to start TiDB results in an error:

Could you please check if there’s a way to recover, or help with remote repair? Compensation is available.

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: 我是咖啡哥 | Original post link

Did the newly expanded node get deleted? I only see 2 nodes. Have you restarted? 2 nodes cannot elect a leader.

| username: TiDBer_vC2lhR9G | Original post link

Tried scaling up, scaling down, and restarting; the configurations might be messed up.

| username: xfworld | Original post link

  1. First, use tiup to decommission the node instances in the external network environment directly.

  2. Then, use tiup to start the internal cluster and observe any errors.

  3. Based on the error logs, troubleshoot the issues step by step and resolve them one by one.

| username: Kongdom | Original post link

You can refer to this to handle it, change the external IP to the internal IP

| username: TiDBer_vC2lhR9G | Original post link

I changed the original file name, and an error occurred when executing tiup cluster reload main_tidb -R pd --force.

| username: Kongdom | Original post link

Is the error due to network connectivity issues?

| username: 考试没答案 | Original post link

The -R is a role parameter. It means starting all roles. It may cause the entire cluster to crash. Please use parameters like -N to restart each node one by one in the future.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.