3 PD Nodes Down: "Load from etcd Meet Error"

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 3台pd 宕:“load from etcd meet error

| username: h5n1

[TiDB Version] v5.2.3 ARM
This morning, I found that the cluster’s PD and TiDB were down. Upon checking the logs of the 151 PD node, issues started occurring around 22:16 on the night of April 15th, with errors similar to: [etcdutil.go:122] [“load from etcd meet error”] [key=/pd/7024798609142208243/leader] [error=“[PD:etcd:ErrEtcdKVGet]context deadline exceeded”]. Restarting the cluster, the 151 PD node fails to start.

| username: WalterWj | Original post link

Scaling or provide the complete pd.log so I can check if there’s any useful information.

| username: h5n1 | Original post link

pdlog.rar (25.4 MB)

| username: WalterWj | Original post link

Why are there two on 151?

| username: h5n1 | Original post link

After restarting, temporarily expanded one. The original one will be scaled down after troubleshooting is completed.

| username: WalterWj | Original post link

Go into pd-ctl, type member, and send the information in text format so I can take a look.

| username: h5n1 | Original post link

{
“header”: {
“cluster_id”: 7024798609142208243
},
“members”: [
{
“name”: “pd-10.172.65.152-23792”,
“member_id”: 9357444448847789453,
“peer_urls”: [
http://10.172.65.152:23802
],
“client_urls”: [
http://10.172.65.152:23792
],
“deploy_path”: “/tidb/pd2/bin”,
“binary_version”: “v5.2.3”,
“git_hash”: “02139dc2a160e24215f634a82b943b2157a2e8ed”
},
{
“name”: “pd-10.172.65.151-23795”,
“member_id”: 11468259817903508096,
“peer_urls”: [
http://10.172.65.151:23805
],
“client_urls”: [
http://10.172.65.151:23795
],
“deploy_path”: “/tidb/pd2_a/bin”,
“binary_version”: “v5.2.3”,
“git_hash”: “02139dc2a160e24215f634a82b943b2157a2e8ed”
},
{
“name”: “pd-10.172.65.151-23792”,
“member_id”: 12415486320424082229,
“peer_urls”: [
http://10.172.65.151:23802
],
“client_urls”: [
http://10.172.65.151:23792
],
“deploy_path”: “/tidb/pd2/bin”,
“binary_version”: “v5.2.3”,
“git_hash”: “02139dc2a160e24215f634a82b943b2157a2e8ed”
},
{
“name”: “pd-10.172.65.146-23792”,
“member_id”: 14508203407204661733,
“peer_urls”: [
http://10.172.65.146:23802
],
“client_urls”: [
http://10.172.65.146:23792
],
“deploy_path”: “/tidb/pd2/bin”,
“binary_version”: “v5.2.3”,
“git_hash”: “02139dc2a160e24215f634a82b943b2157a2e8ed”
}
],
“leader”: {
“name”: “pd-10.172.65.152-23792”,
“member_id”: 9357444448847789453,
“peer_urls”: [
http://10.172.65.152:23802
],
“client_urls”: [
http://10.172.65.152:23792
],
“deploy_path”: “/tidb/pd2/bin”,
“binary_version”: “v5.2.3”,
“git_hash”: “02139dc2a160e24215f634a82b943b2157a2e8ed”
},
“etcd_leader”: {
“name”: “pd-10.172.65.152-23792”,
“member_id”: 9357444448847789453,
“peer_urls”: [
http://10.172.65.152:23802
],
“client_urls”: [
http://10.172.65.152:23792
],
“deploy_path”: “/tidb/pd2/bin”,
“binary_version”: “v5.2.3”,
“git_hash”: “02139dc2a160e24215f634a82b943b2157a2e8ed”
}
}

| username: QH琉璃 | Original post link

Waiting for the expert’s answer.

| username: TiDBer_jYQINSnf | Original post link

It means the etcd client failed to access the leader. It timed out.

[2024/04/16 05:17:04.317 +08:00] [WARN] [etcdutil.go:117] ["kv gets too slow"] [request-key=/pd/7024798609142208243/leader] [cost=10.000469751s] [error="context deadline exceeded"]
[2024/04/16 05:17:04.317 +08:00] [ERROR] [etcdutil.go:122] ["load from etcd meet error"] [key=/pd/7024798609142208243/leader] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded"]
[2024/04/16 05:17:04.317 +08:00] [ERROR] [member.go:166] ["getting pd leader meets error"] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded"]

You should have checked the network, otherwise, you can capture packets to see.

| username: ffeenn | Original post link

I also encountered this problem, and my situation was worse than yours. Two out of three PDs went down, and I had to restore the PD data from a backup. The main error was that the etcd data became inconsistent.

| username: h5n1 | Original post link

At that time, I didn’t check the network. If it was a network issue that caused the downtime, why couldn’t the 151 node be brought back up afterward?

| username: WalterWj | Original post link

That shouldn’t be the case :thinking:. I see from the logs that fetching the member list isn’t working, but it works manually. Can you try manually calling this API to see if it returns any content:

2024/04/16 10:04:56.799 log.go:85: [warning] etcdserver: [could not get cluster response from http://10.172.65.146:23802: Get "http://10.172.65.146:23802/members": dial tcp 10.172.65.146:23802: connect: connection refused]
[2024/04/16 10:04:56.799 +08:00] [ERROR] [etcdutil.go:70] ["failed to get cluster from remote"] [error="[PD:etcd:ErrEtcdGetCluster]could not retrieve cluster information from the given URLs"]
[2024/04/16 10:04:58.823 +08:00] [INFO] [stream.go:250] ["set message encoder"] [from=ac4cab0742995b35] [to=ac4cab0742995b35] [stream-type="stream Message"]

curl http://10.172.65.146:23802/members

| username: h5n1 | Original post link

curl http://10.172.65.146:23802/members
[{"id":9357444448847789453,"peerURLs":["http://10.172.65.152:23802"],"name":"pd-10.172.65.152-23792","clientURLs":["http://10.172.65.152:23792"]},{"id":11468259817903508096,"peerURLs":["http://10.172.65.151:23805"],"name":"pd-10.172.65.151-23795","clientURLs":["http://10.172.65.151:23795"]},{"id":12415486320424082229,"peerURLs":["http://10.172.65.151:23802"],"name":"pd-10.172.65.151-23792","clientURLs":["http://10.172.65.151:23792"]},{"id":14508203407204661733,"peerURLs":["http://10.172.65.146:23802"],"name":"pd-10.172.65.146-23792","clientURLs":["http://10.172.65.146:23792"]}]
| username: WalterWj | Original post link

:thinking: That’s a bit confusing. Logically speaking, the PD part of this logic is no different from manually using curl.

| username: h5n1 | Original post link

This time should be around when I manually started the cluster. Initially, starting it didn’t work. I checked the PD process (didn’t check the one on node 151), then stopped the cluster and started it again, and it worked. After that, I added a PD. The original PD on 151:28792 couldn’t start up. At 22:16 last night, the PD logs started showing errors. Previously, node 151 was the leader.

| username: h5n1 | Original post link

diag-btjh-gnZWsm3Qd4F.tar.gz (23.7 MB)

| username: oceanzhang | Original post link

Has it been resolved?

| username: oceanzhang | Original post link

It seems that TiDB still requires purchasing original factory services.

| username: 数据库真NB | Original post link

Waiting for the latest reply from user H5N1.

| username: shigp_TIDBER | Original post link

Waiting for the experts to offer their advice…