TiDB service fails to start after power outage

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 断电后tidb服务启动不了

| username: TiDBer_gJ3eDqHX

[TiDB Usage Environment] Production Environment
[TiDB Version] V5.4.0
[Reproduction Path] Restart TiDB service
[Encountered Problem: Phenomenon and Impact]



I saw an error in the PD log: main.go:122] [“run server failed”] [error=“[PD:leveldb:ErrLevelDBOpen]leveldb: manifest corrupted (field ‘comparer’): missing [file=MANIFEST-000030]”] [stack=“main.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:122\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225”
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: h5n1 | Original post link

Is it a test environment? Just set up 1 PD. If this PD fails, use pd-recover to restore it. There are many articles in the column:

| username: ShawnYan | Original post link

In the production environment, there’s only one PD? It’s not good to repair if the cluster brain doesn’t have HA and breaks down. Use the tool mentioned above, or consider seeking support from the original manufacturer.

| username: TiDBer_gJ3eDqHX | Original post link

Will all my data be lost if I operate like this in the production environment?

| username: h5n1 | Original post link

Data will not be lost, configure at least 3 PD nodes.

| username: TiDBer_gJ3eDqHX | Original post link

We only have one right now.

| username: Fly-bird | Original post link

Let’s see if we can fix PD. Playing around in a production environment is too risky. I suggest deploying PD on the TiDB server; three PD instances should be sufficient.

| username: TiDBer_gJ3eDqHX | Original post link

Are there any other solutions if it cannot be fixed?

| username: TiDBer_gJ3eDqHX | Original post link

My current PD cannot start and reports the following error when executed:

./pd-recover -endpoints http://10.0.0.30:2379 -cluster-id 7088536805883498676 -alloc-id 35001
{"level":"warn","ts":"2023-11-07T19:34:38.738+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-464995ca-5e66-4985-b5a7-eacc0288f143/10.0.0.30:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.0.30:2379: connect: no route to host\""}
context deadline exceeded
| username: 芮芮是产品 | Original post link

No, you can find me to restore PD. I am in the second group, called Oscar Jiang Ming.

| username: h5n1 | Original post link

Rename the original PD directory and restart or redeploy a new PD-only cluster, then use cain’recover. You need to have a functioning PD first.

| username: TiDBer_gJ3eDqHX | Original post link

Which group is it?

| username: zxgaa | Original post link

Usually three PDs

| username: ShawnYan | Original post link

Add your cousin on WeChat; there’s a mutual assistance group chat.

| username: 像风一样的男子 | Original post link

Master Jiang is awesome, hurry up and find him.

| username: Billmay表妹 | Original post link

Follow everyone’s suggestions.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.