PD cannot start, indicating cluster ID mismatch

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd无法启动,提示集群id不匹配

| username: 月明星稀

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.0
[Reproduction Path] Fails right after creation
[Encountered Problem: Symptoms and Impact] PD fails to start, error log as follows. Could an expert take a look at what might be causing this? There is no cluster ID configuration in the settings, and the deployment method used is tiup.
[FATAL] [main.go:117] [“run server failed”] [error=“Etcd cluster ID mismatch, expect 15606846352041391624, got 16540809998682290851”] [stack=“main.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:117\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]

| username: tidb菜鸟一只 | Original post link

Is this not the first deployment? Is the environment not clean?

| username: 月明星稀 | Original post link

I’ll try to destroy it.

| username: TiDB_C罗 | Original post link

Insufficient resources?

| username: ti-tiger | Original post link

The issue might be caused by the following reasons:

  1. The etcd cluster ID among PD nodes is inconsistent, leading to startup and communication failures. This could be due to using different --initial-cluster parameters during PD node deployment or configuration, or running other etcd instances on PD nodes, resulting in different etcd cluster IDs.
  2. The etcd data directories among PD nodes are inconsistent, causing startup and synchronization failures. This might be because different --data-dir parameters were used during PD node deployment or configuration, or the etcd data directory was modified on PD nodes, leading to data inconsistency.

Solution:

  1. Check the startup parameters of PD nodes to ensure all PD nodes use the same --initial-cluster parameter and include information about all PD nodes. For example, if there are three PD nodes named pd1, pd2, and pd3, you can start the PD nodes with the following parameter:
--initial-cluster pd1=http://pd1:2380,pd2=http://pd2:2380,pd3=http://pd3:2380
  1. Check the startup parameters of PD nodes to ensure all PD nodes use the same --data-dir parameter and point to an empty or cleared data directory. For example, if you want to use /data/pd as the data directory, you can start the PD nodes with the following parameter:
--data-dir /data/pd
  • If the above methods still do not resolve the issue, you can try deleting the etcd data directory on all PD nodes and restarting the PD nodes. Note that this will result in the loss of all existing data and configurations, so proceed with caution and make sure to back up before performing this operation. For example, if using /data/pd as the data directory, you can delete the data directory with the following command:
rm -rf /data/pd
| username: Kongdom | Original post link

Did you rebuild the PD?