PD startup failure

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd启动失败

| username: dxss-lee

When the cluster restarts, the PD node fails to start, and the log shows that the file 00000000000dbba9.snap.db cannot be found. According to the log prompt, a file named 0000000000000013-00000000000dbba9.snap was found in the corresponding directory.

Other PD nodes that can start normally also have the same 00000000000000xx-00000000000xxxxx.snap files in the same directory as the faulty node, and there are no files ending with snap.db.

Deleting the node and then adding the node can fix it, but there is a high probability that the issue will reoccur after the cluster stops and restarts.

Could you please explain what causes this issue and if there are any other ways to fix this error besides deleting and adding the node again?

| username: 考试没答案 | Original post link

The cases found on Baidu indicate data file corruption, and I need to understand more about etcd. PD actually integrates etcd. https://zhuanlan.zhihu.com/p/558940973

| username: 考试没答案 | Original post link

Is it that one node has crashed and cannot start, or all three nodes have crashed and cannot start???

| username: dxss-lee | Original post link

One of the three PDs cannot start.

| username: songxuecheng | Original post link

Copy the file 0000000000000013-00000000000dbba9.snap to another location for backup, then delete it and try restarting.

| username: dxss-lee | Original post link

There is an alert indicating “exceeded recommended request limit.” I am not sure if it has any impact.

| username: 考试没答案 | Original post link

Please list the contents of the member/snap directory under the three PDs using the command ls -al. Let me take a look.

| username: dxss-lee | Original post link

After deleting all the snap files under the faulty node, the node started up. Thank you. Could you please explain the principle behind this?

| username: songxuecheng | Original post link

You can understand the backup and recovery of etcd.

| username: 考试没答案 | Original post link

Built-in etcd. Check out etcd’s fault recovery. I manually deleted a snap file, and it recovered after restarting.

| username: dxss-lee | Original post link

There are a total of three nodes, and the second image shows the faulty node (data after recovery).

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.