Unexpected Failure After PD Cluster Restart, Member Disorder

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PD集群重启后莫名故障,member错乱

| username: TiDBer_G64jJ9u8

[TiDB Usage Environment] Test/PoC
[TiDB Version] 6.5.0
[Reproduction Path] Fresh installation, scaling down and up pd, kv, tidb through K8S to restart tidb
[Encountered Issue: Problem Phenomenon and Impact] PD cluster members are in disarray
[Resource Configuration]
[Attachment: Screenshot/Log/Monitoring]
Three nodes: pd0, pd1, pd2

During the first failure, the pd0 container kept restarting, and port 2379 was inaccessible. Checking the member results on pd1 and pd2 showed the following disarray:
/ # ./pd-ctl member
{
“header”: {
“cluster_id”: 7314281662141429344
},
“members”: [
{
“name”: “basic-pd-1”,
“member_id”: 40391632394141711,
“peer_urls”: [
http://basic-pd-0.basic-pd-peer.my-app.svc:2380
],
“client_urls”: [
http://basic-pd-1.basic-pd-peer.my-app.svc:2379
],
“deploy_path”: “/”,
“binary_version”: “v6.5.0”,
“git_hash”: “d1a4433c3126c77fb2d5bb5720eefa0f2e05c166”
},
{
“name”: “basic-pd-1”,
“member_id”: 3188211451946452644,
“peer_urls”: [
http://basic-pd-1.basic-pd-peer.my-app.svc:2380
],
“client_urls”: [
http://basic-pd-1.basic-pd-peer.my-app.svc:2379
],
“deploy_path”: “/”,
“binary_version”: “v6.5.0”,
“git_hash”: “d1a4433c3126c77fb2d5bb5720eefa0f2e05c166”
},
{
“name”: “basic-pd-2”,
“member_id”: 16081184113326999218,
“peer_urls”: [
http://basic-pd-2.basic-pd-peer.my-app.svc:2380
],
“client_urls”: [
http://basic-pd-2.basic-pd-peer.my-app.svc:2379
],
“deploy_path”: “/”,
“binary_version”: “v6.5.0”,
“git_hash”: “d1a4433c3126c77fb2d5bb5720eefa0f2e05c166”
}
],
“leader”: {
“name”: “basic-pd-1”,
“member_id”: 40391632394141711,
“peer_urls”: [
http://basic-pd-0.basic-pd-peer.my-app.svc:2380
],
“client_urls”: [
http://basic-pd-1.basic-pd-peer.my-app.svc:2379
],
“deploy_path”: “/”,
“binary_version”: “v6.5.0”,
“git_hash”: “d1a4433c3126c77fb2d5bb5720eefa0f2e05c166”
},
“etcd_leader”: {
“name”: “basic-pd-1”,
“member_id”: 40391632394141711,
“peer_urls”: [
http://basic-pd-0.basic-pd-peer.my-app.svc:2380
],
“client_urls”: [
http://basic-pd-1.basic-pd-peer.my-app.svc:2379
],
“deploy_path”: “/”,
“binary_version”: “v6.5.0”,
“git_hash”: “d1a4433c3126c77fb2d5bb5720eefa0f2e05c166”
}
}

Two pd1 members appeared, and the peer_urls and client_urls in the first member pointed to different addresses. The leader was pd1, and at this time, accessing the database through tidb was normal.
At this point, to restore the cluster, I attempted to transfer the leader to pd2 and delete the node 40391632394141711. Ideally, pd1 and pd2 should form a normal cluster, but unexpectedly:
/ # /pd-ctl member
{
“header”: {
“cluster_id”: 7314281662141429344
},
“members”: [
{
“name”: “basic-pd-0”,
“member_id”: 3188211451946452644,
“peer_urls”: [
http://basic-pd-1.basic-pd-peer.my-app.svc:2380
],
“client_urls”: [
http://basic-pd-0.basic-pd-peer.my-app.svc:2379
],
“deploy_path”: “/”,
“binary_version”: “v6.5.0”,
“git_hash”: “d1a4433c3126c77fb2d5bb5720eefa0f2e05c166”
},
{
“name”: “basic-pd-2”,
“member_id”: 16081184113326999218,
“peer_urls”: [
http://basic-pd-2.basic-pd-peer.my-app.svc:2380
],
“client_urls”: [
http://basic-pd-2.basic-pd-peer.my-app.svc:2379
],
“deploy_path”: “/”,
“binary_version”: “v6.5.0”,
“git_hash”: “d1a4433c3126c77fb2d5bb5720eefa0f2e05c166”
}
],
“leader”: {
“name”: “basic-pd-2”,
“member_id”: 16081184113326999218,
“peer_urls”: [
http://basic-pd-2.basic-pd-peer.my-app.svc:2380
],
“client_urls”: [
http://basic-pd-2.basic-pd-peer.my-app.svc:2379
],
“deploy_path”: “/”,
“binary_version”: “v6.5.0”,
“git_hash”: “d1a4433c3126c77fb2d5bb5720eefa0f2e05c166”
},
“etcd_leader”: {
“name”: “basic-pd-2”,
“member_id”: 16081184113326999218,
“peer_urls”: [
http://basic-pd-2.basic-pd-peer.my-app.svc:2380
],
“client_urls”: [
http://basic-pd-2.basic-pd-peer.my-app.svc:2379
],
“deploy_path”: “/”,
“binary_version”: “v6.5.0”,
“git_hash”: “d1a4433c3126c77fb2d5bb5720eefa0f2e05c166”
}
}
I intended to delete pd0, but pd1 was deleted instead, and pd0 was still not functioning properly:
“peer_urls”: [
http://basic-pd-1.basic-pd-peer.my-app.svc:2380
],
“client_urls”: [
http://basic-pd-0.basic-pd-peer.my-app.svc:2379
],

At this point, I tried deleting pd1’s data and restarting, but it couldn’t join the cluster. Lastly, I must say, pd is really fragile and keeps having issues.

First failure pd0 log
pd0 (801.2 KB)

| username: Billmay表妹 | Original post link

Take a look at your component deployment:

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

| username: Miracle | Original post link

What was the time of the failure? After deployment, after scaling down, or after scaling up?

| username: TiDBer_G64jJ9u8 | Original post link

After expansion, a failure occurred upon startup.

| username: Miracle | Original post link

Did you specify pd-0 for scaling down?

| username: TiDBer_G64jJ9u8 | Original post link

Scale down the entire cluster including PD, KV, and TiDB. The goal is to restart the entire service. Scale everything down to 0, then scale everything back up to 3.

| username: Miracle | Original post link

Scale down to 2 instances?
If it’s just for restarting, there’s no need to scale down and then scale up again. Can’t you just restart the pod directly?

| username: Miracle | Original post link

I thought you were scaling down to 2 instances and then scaling up to 3 instances, I misunderstood.
Are your pods starting in parallel?

| username: TiDBer_G64jJ9u8 | Original post link

Is there still an issue with parallel startup?

| username: 小龙虾爱大龙虾 | Original post link

First of all, I don’t understand k8s very well. I think scaling down to 0 won’t work. PD is stateful, and if you scale it down to 0, all the data will be gone. When you scale it back up to 3, will these new instances use the previous PVCs? Probably not, right?

| username: Miracle | Original post link

Here’s my guess: :joy:
Because our STS are sequential, operations like scaling to 0 and then scaling to 3 have never caused any issues.

| username: Miracle | Original post link

What he means by “shrinking to 0” is equivalent to shutting down all PDs. As long as the PV data is not specifically deleted, the data will still be there. Expanding back to 3 is equivalent to restarting the PDs.

| username: 小龙虾爱大龙虾 | Original post link

Is that so? Check if the PVC bound to the POD is still the original one.

| username: WalterWj | Original post link

This operation is quite impressive. Why not do it this way: 重启 Kubernetes 上的 TiDB 集群 | PingCAP 文档中心

| username: TiDBer_G64jJ9u8 | Original post link

Previously, I also suspected that the PVC binding was incorrect. After checking the PVC binding, it turned out to be correct.

| username: TiDBer_G64jJ9u8 | Original post link

The link 重启 Kubernetes 上的 TiDB 集群 | PingCAP 文档中心 is completely unreliable:
I tried the command
kubectl -n ${namespace} annotate pod ${tikv_pod_name} tidb.pingcap.com/evict-leader=“delete-pod”

The KV is running perfectly fine, and there’s no response at all.

| username: yiduoyunQ | Original post link

The Operator does not support “scaling down PD to 0 and then scaling up to 3 again,” and TiUP also cannot perform such operations conventionally.

| username: WalterWj | Original post link

It’s too difficult :thinking:

| username: TiDBer_G64jJ9u8 | Original post link

Then you need to provide a reliable way to restart! There’s none? Surprising!
I’ll try the other commands in here today: [重启 Kubernetes 上的 TiDB 集群 | PingCAP 文档中心] and see if they are all unreliable!

| username: dba远航 | Original post link

Check if the capacity expansion configuration has errors.