TiKV cluster fails to scale out PD when PD is down

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 集群在 pd 故障时进行 pd scale out 失败

| username: cat0dog

[TiDB Usage Environment] Production Environment
[TiDB Version]
tikv 6.5.0
[Reproduction Path] What operations were performed when the issue occurred
Discovered a hard disk failure causing one PD and TiKV service to be in a Down state
Attempted to bring up a new PD node via scale out + scale in and remove the original faulty PD

[Encountered Issue: Symptoms and Impact]
Failed to start the new PD node
[2024/01/23 11:27:42.636 +08:00] [FATAL] [main.go:91] [“join meet error”] [error=“etcdserver: unhealthy cluster”] [stack=“main.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:91\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]

[2024/01/23 11:28:58.547 +08:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=endpoint://client-01788ffb-4391-4d54-9a41-e121785ca621/10.81.200.101:3379] [attempt=0] [error=“rpc error: code = Unavailable desc = etcdserver: unhealthy cluster”]

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

[Attachments: Screenshots/Logs/Monitoring]

| username: tidb狂热爱好者 | Original post link

You have hung up two already.

| username: cat0dog | Original post link

The PD with the status “Down|UI” is down due to a hard disk failure, and the other PD in the “Down” state failed to start during scale-out.

| username: 小龙虾爱大龙虾 | Original post link

Scale down and then scale up again.

| username: cat0dog | Original post link

Due to disk failure, the PD node cannot be properly scaled down. Do I need to use the -force option to forcibly remove the faulty PD? Are there any precautions for forced removal?

| username: cat0dog | Original post link

After adding --force, the faulty PD was successfully removed. Thank you.

| username: dba远航 | Original post link

PD can no longer form a majority.

| username: ffeenn | Original post link

Deleting one of the three PDs forcibly has basically no impact. If the PD cannot be started during scaling, you can try using PD-Recover to restore the PD cluster. PD Recover Documentation | PingCAP Documentation Center
Follow the normal procedure: first restore PD, then restore KV.