PD node fails to start after abnormal crash

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 异常宕机后pd节点起不来

| username: 人如其名

【TiDB Environment】centos7_x86
【TiDB Version】6.3.0
【Encountered Issue】PD node fails to start after a virtual machine crash
【Problem Phenomenon and Impact】

pd_log.txt (8.1 KB)

Error in PD log: [2022/10/05 18:40:32.300 +08:00] [PANIC] [key_index.go:82] [“‘put’ with an unexpected smaller revision”] [given-revision-main=209167] [given-revision-sub=0] [modified-revision-main=209167] [modified-revision-sub=0]

How can I bypass the error and start PD normally without recreating PD?

| username: 人如其名 | Original post link

The method to recreate PD (clearing the data directory) can be repaired using the official PD-recover tool. The link is as follows:

But isn’t the etcd used by PD synchronized and persistent? Why does the index inconsistency occur after startup? Previously, it couldn’t start up after an abnormal crash, and it was rebuilt according to the above method.

| username: forever | Original post link

My virtual machine has also encountered this problem. I feel it’s an issue with the virtual machine. I understand that when pd and etcd on the virtual machine crash, they don’t save, leading to data inconsistency on some nodes.

| username: 人如其名 | Original post link

So the virtual machine’s disk is not actually written to the host hard drive in real-time? Most of my production environment consists of virtual machines, and I’m not sure if this issue will occur.

| username: 大鱼海棠 | Original post link

I feel that this needs to check the mechanism of the virtual machine, etcd defaults to real-time disk writing.
In production, it’s still necessary to deploy 3 PD nodes to ensure high availability.

| username: forever | Original post link

The version of the virtual machine used in production might be different from the one used personally; I haven’t studied this principle in detail. For the Windows system, it’s just a program, and writing to disk feels asynchronous. Therefore, it might seem like the data has been written to disk inside the virtual machine, but externally (stored as a bunch of virtual files on the computer), it might not have been written to disk in time. I’ve tried directly shutting down the virtual machine without shutting down the operating system, similar to a power outage for the virtual machine, and there were no issues with PD (the computer was still on, so the virtual disk files should have been automatically saved). However, when I couldn’t remotely access my computer and forcibly shut it down, I encountered a situation where PD couldn’t start.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.