After a TiDB Node Restarts, the Entire TiDB System Fails

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB一个节点重启后,整个TiDB故障

| username: TiDBer_G64jJ9u8

[TiDB Usage Environment] Kylin V10 SP1 arm, three-node deployment: 3 PD, 3 TiKV, 3 TiDB
[Overview] TiDB 6.5.0 deployed in K8S environment, one PD POD failure caused TiDB to be unable to provide service
[Background] One physical node restarted
[Phenomenon] All microservices failed to connect to the database
[Issue] TiDB unable to provide service
[Business Impact] System crash
[TiDB Version] 6.5.0
[Application Software and Version]
[TiDB Operator] 1.4.3
[K8S] 1.20.7
[Attachment]
/pd-server --data-dir=/var/lib/pd --name=basic-pd-2 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://basic-pd-2.basic-pd-peer.my-namespace.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://basic-pd-2.basic-pd-peer.my-namespace.svc:2379 --config=/etc/pd/pd.toml

[root@pf-test-2 ~]# kubectl -n my-namespace logs basic-pd-2 pd -f
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: basic-pd-2.basic-pd-peer.my-namespace.svc
Address 1: 100.87.244.178 basic-pd-2.basic-pd-peer.my-namespace.svc.cluster.local
nslookup domain basic-pd-2.basic-pd-peer.my-namespace.svc.svc success
starting pd-server …
/pd-server --data-dir=/var/lib/pd --name=basic-pd-2 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://basic-pd-2.basic-pd-peer.my-namespace.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://basic-pd-2.basic-pd-peer.my-namespace.svc:2379 --config=/etc/pd/pd.toml --join=http://basic-pd-1.basic-pd-peer.my-namespace.svc:2380,http://basic-pd-2.basic-pd-peer.my-namespace.svc:2380,http://basic-pd-0.basic-pd-peer.my-namespace.svc:2380
[2023/07/22 07:30:03.600 +00:00] [INFO] [util.go:41] [“Welcome to Placement Driver (PD)”]
[2023/07/22 07:30:03.601 +00:00] [INFO] [util.go:42] [PD] [release-version=v6.5.0]
[2023/07/22 07:30:03.601 +00:00] [INFO] [util.go:43] [PD] [edition=Community]
[2023/07/22 07:30:03.601 +00:00] [INFO] [util.go:44] [PD] [git-hash=d1a4433c3126c77fb2d5bb5720eefa0f2e05c166]
[2023/07/22 07:30:03.601 +00:00] [INFO] [util.go:45] [PD] [git-branch=heads/refs/tags/v6.5.0]
[2023/07/22 07:30:03.601 +00:00] [INFO] [util.go:46] [PD] [utc-build-time=“2022-12-05 01:43:11”]
[2023/07/22 07:30:03.601 +00:00] [INFO] [metricutil.go:83] [“disable Prometheus push client”]
[2023/07/22 07:30:03.601 +00:00] [INFO] [server.go:247] [“PD Config”] [config=“{"client-urls":"http://0.0.0.0:2379","peer-urls":"http://0.0.0.0:2380","advertise-client-urls":"http://basic-pd-2.basic-pd-peer.my-namespace.svc:2379","advertise-peer-urls":"http://basic-pd-2.basic-pd-peer.my-namespace.svc:2380","name":"basic-pd-2","data-dir":"/var/lib/pd","force-new-cluster":false,"enable-grpc-gateway":true,"initial-cluster":"basic-pd-1=http://basic-pd-1.basic-pd-peer.my-namespace.svc:2380,basic-pd-2=http://basic-pd-2.basic-pd-peer.my-namespace.svc:2380,basic-pd-0=http://basic-pd-0.basic-pd-peer.my-namespace.svc:2380","initial-cluster-state":"existing","initial-cluster-token":"pd-cluster","join":"http://basic-pd-1.basic-pd-peer.my-namespace.svc:2380,http://basic-pd-2.basic-pd-peer.my-namespace.svc:2380,http://basic-pd-0.basic-pd-peer.my-namespace.svc:2380","lease":3,"log":{"level":"info","format":"text","disable-timestamp":false,"file":{"filename":"","max-size":0,"max-days":0,"max-backups":0},"development":false,"disable-caller":false,"disable-stacktrace":false,"disable-error-verbose":true,"sampling":null,"error-output-path":""},"tso-save-interval":"3s","tso-update-physical-interval":"50ms","enable-local-tso":false,"metric":{"job":"basic-pd-2","address":"","interval":"15s"},"schedule":{"max-snapshot-count":64,"max-pending-peer-count":64,"max-merge-region-size":20,"max-merge-region-keys":0,"split-merge-interval":"1h0m0s","swtich-witness-interval":"1h0m0s","enable-one-way-merge":"false","enable-cross-table-merge":"true","patrol-region-interval":"10ms","max-store-down-time":"30m0s","max-store-preparing-time":"48h0m0s","leader-schedule-limit":4,"leader-schedule-policy":"count","region-schedule-limit":2048,"replica-schedule-limit":64,"merge-schedule-limit":8,"hot-region-schedule-limit":4,"hot-region-cache-hits-threshold":3,"store-limit":{},"tolerant-size-ratio":0,"low-space-ratio":0.8,"high-space-ratio":0.7,"region-score-formula-version":"v2","scheduler-max-waiting-operator":5,"enable-remove-down-replica":"true","enable-replace-offline-replica":"true","enable-make-up-replica":"true","enable-remove-extra-replica":"true","enable-location-replacement":"true","enable-debug-metrics":"false","enable-joint-consensus":"true","enable-tikv-split-region":"true","schedulers-v2":[{"type":"balance-region","args":null,"disable":false,"args-payload":""},{"type":"balance-leader","args":null,"disable":false,"args-payload":""},{"type":"hot-region","args":null,"disable":false,"args-payload":""},{"type":"split-bucket","args":null,"disable":false,"args-payload":""}],"schedulers-payload":null,"store-limit-mode":"manual","hot-regions-write-interval":"10m0s","hot-regions-reserved-days":7,"enable-diagnostic":"false","enable-witness":"false"},"replication":{"max-replicas":3,"location-labels":"","strictly-match-label":"false","enable-placement-rules":"true","enable-placement-rules-cache":"false","isolation-level":""},"pd-server":{"use-region-storage":"true","max-gap-reset-ts":"24h0m0s","key-type":"table","runtime-services":"","metric-storage":"","dashboard-address":"auto","trace-region-flow":"true","flow-round-by-digit":3,"min-resolved-ts-persistence-interval":"1s"},"cluster-version":"0.0.0","labels":{},"quota-backend-bytes":"8GiB","auto-compaction-mode":"periodic","auto-compaction-retention-v2":"1h","TickInterval":"500ms","ElectionInterval":"3s","PreVote":true,"max-request-bytes":157286400,"security":{"cacert-path":"","cert-path":"","key-path":"","cert-allowed-cn":null,"SSLCABytes":null,"SSLCertBytes":null,"SSLKEYBytes":null,"redact-info-log":false,"encryption":{"data-encryption-method":"plaintext","data-key-rotation-period":"168h0m0s","master-key":{"type":"plaintext","key-id":"","region":"","endpoint":"","path":""}}},"label-property":null,"WarningMsgs":null,"DisableStrictReconfigCheck":false,"HeartbeatStreamBindInterval":"1m0s","LeaderPriorityCheckInterval":"1m0s","dashboard":{"tidb-cacert-path":"","tidb-cert-path":"","tidb-key-path":"","public-path-prefix":"","internal-proxy":false,"enable-telemetry":true,"enable-experimental":false},"replication-mode":{"replication-mode":"majority","dr-auto-sync":{"label-key":"","primary":"","dr":"","primary-replicas":0,"dr-replicas":0,"wait-store-timeout":"1m0s","pause-region-split":"false"}}}”]
[2023/07/22 07:30:03.608 +00:00] [INFO] [server.go:222] [“register REST path”] [path=/pd/api/v1]
[2023/07/22 07:30:03.608 +00:00] [INFO] [server.go:222] [“register REST path”] [path=/pd/api/v2/]
[2023/07/22 07:30:03.608 +00:00] [INFO] [server.go:222] [“register REST path”] [path=/swagger/]
[2023/07/22 07:30:03.608 +00:00] [INFO] [server.go:222] [“register REST path”] [path=/autoscaling]
[2023/07/22 07:30:03.608 +00:00] [INFO] [distro.go:51] [“Using distribution strings”] [strings={}]
[2023/07/22 07:30:03.609 +00:00] [INFO] [server.go:222] [“register REST path”] [path=/dashboard/api/]
[2023/07/22 07:30:03.609 +00:00] [INFO] [server.go:222] [“register REST path”] [path=/dashboard/]
[2023/07/22 07:30:03.610 +00:00] [INFO] [etcd.go:117] [“configuring peer listeners”] [listen-peer-urls=“[http://0.0.0.0:2380]”]
[2023/07/22 07:30:03.610 +00:00] [INFO] [systimemon.go:28] [“start system time monitor”]
[2023/07/22 07:30:03.610 +00:00] [INFO] [etcd.go:127] [“configuring client listeners”] [listen-client-urls=“[http://0.0.0.0:2379]”]
[2023/07/22 07:30:03.610 +00:00] [INFO] [etcd.go:611] [“pprof is enabled”] [path=/debug/pprof]
[2023/07/22 07:30:03.610 +00:00] [INFO] [etcd.go:305] [“starting an etcd server”] [etcd-version=3.4.21] [git-sha=“Not provided (use ./build instead of go build)”] [go-version=go1.19.3] [go-os=linux] [go-arch=arm64] [max-cpu-set=64] [max-cpu-available=64] [member-initialized=true] [name=basic-pd-2] [data-dir=/var/lib/pd] [wal-dir=] [wal-dir-dedicated=] [member-dir=/var/lib/pd/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls=“[http://basic-pd-2.basic-pd-peer.my-namespace.svc:2380]”] [listen-peer-urls=“[http://0.0.0.0:2380]”] [advertise-client-urls=“[http://basic-pd-2.basic-pd-peer.my-namespace.svc:2379]”] [listen-client-urls=“[http://0.0.0.0:2379]”] [listen-metrics-urls=“”] [cors=“[]“] [host-whitelist=”[]”] [initial-cluster=] [initial-cluster-state=existing] [initial-cluster-token=] [quota-backend-bytes=8589934592] [max-request-bytes=157286400] [max-concurrent-streams=4294967295] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2023/07/22 07:30:03.610 +00:00] [WARN] [server.go:297] [“exceeded recommended request limit”] [max-request-bytes=157286400] [max-request-size=“157 MB”] [recommended-request-bytes=10485760] [recommended-request-size=“10 MB”]
2023-07-22 07:30:03.610906 W | pkg/fileutil: check file permission: directory “/var/lib/pd” exist, but the permission is “drwxr-xr-x”. The recommended permission is “-rwx------” to prevent possible unprivileged access to the data.
[2023/07/22 07:30:03.633 +00:00] [INFO] [backend.go:80] [“opened backend db”] [path=/var/lib/pd/member/snap/db] [took=22.401135ms]
[2023/07/22 07:30:04.558 +00:00] [INFO] [server.go:462] [“recovered v2 store from snapshot”] [snapshot-index=500005] [snapshot-size=“9.8 kB”]
[2023/07/22 07:30:04.559 +00:00] [INFO] [kvstore.go:388] [“restored last compact revision”] [meta-bucket-name=meta] [meta-bucket-name-key=finishedCompactRev] [restored-compact-revision=472722]
[2023/07/22 07:30:04.603 +00:00] [INFO] [server.go:480] [“recovered v3 backend from snapshot”] [backend-size-bytes=9961472] [backend-size=“10 MB”] [backend-size-in-use-bytes=6422528] [backend-size-in-use=“6.4 MB”]
[2023/07/22 07:30:04.706 +00:00] [INFO] [raft.go:586] [“restarting local member”] [cluster-id=95d1ea70524eb4dc] [local-member-id=97f207d1729a6d18] [commit-index=500214]
[2023/07/22 07:30:04.706 +00:00] [INFO] [raft.go:1523] [“97f207d1729a6d18 switched to configuration voters=(6347619308011026086 10948822240243379480 12637012018295520321)”]
[2023/07/22 07:30:04.706 +00:00] [INFO] [raft.go:706] [“97f207d1729a6d18 became follower at term 4”]
[2023/07/22 07:30:04.706 +00:00] [INFO] [raft.go:389] [“newRaft 97f207d1729a6d18 [peers: [58174621276876a6,97f207d1729a6d18,af5faf7a14d8bc41], term: 4, commit: 500214, applied: 500005, lastindex: 500215, lastterm: 4]”]
[2023/07/22 07:30:04.707 +00:00] [INFO] [capability.go:76] [“enabled capabilities for version”] [cluster-version=3.4]
[2023/07/22 07:30:04.707 +00:00] [INFO] [cluster.go:256] [“recovered/added member from store”] [cluster-id=95d1ea70524eb4dc] [local-member-id=97f207d1729a6d18] [recovered-remote-peer-id=58174621276876a6] [recovered-remote-peer-urls=“[http://basic-pd-1.basic-pd-peer.my-namespace.svc:2380]”]
[2023/07/22 07:30:04.707 +00:00] [INFO] [cluster.go:256] [“recovered/added member from store”] [cluster-id=95d1ea70524eb4dc] [local-member-id=97f207d1729a6d18] [recovered-remote-peer-id=97f207d1729a6d18] [recovered-remote-peer-urls=“[http://basic-pd-2.basic-pd-peer.my-namespace.svc:2380]”]
[2023/07/22 07:30:04.707 +00:00] [INFO] [cluster.go:256] [“recovered/added member from store”] [cluster

| username: tidb菜鸟一只 | Original post link

Is it just one PD that is abnormal, or are all of them abnormal?

| username: TiDBer_G64jJ9u8 | Original post link

Only one PD is abnormal.

| username: ljluestc | Original post link

Based on the provided logs, it appears that the TiDB PD server encountered an issue and is unable to recover properly. The server reports that a certain member has been permanently removed from the cluster and that the data directory used by that member needs to be deleted. This indicates that the PD cluster is unable to maintain consensus due to the loss of a member, which is likely causing the TiDB cluster to be unavailable.

Here are some steps that can be taken to investigate and resolve the issue:

  1. Check the status of the PD cluster: Verify the status of the PD cluster to see if all members are running and healthy. You can use the following command to get the status of the PD cluster:
kubectl -n my-namespace exec basic-pd-2 -- pd-ctl -u http://127.0.0.1:2379 store
  1. Check the logs: Examine the logs of all PD pods (basic-pd-0, basic-pd-1, and basic-pd-2) for more detailed information about errors or failures. You can use the following commands to view the logs:
kubectl -n my-namespace logs basic-pd-0 pd
kubectl -n my-namespace logs basic-pd-1 pd
kubectl -n my-namespace logs basic-pd-2 pd
  1. Investigate node restarts: Look into events and logs related to the physical node restarts. Check for any signs of hardware issues or errors encountered by the PD pod on that node during the restart.

  2. Check K8S events: Use the following command to view events related to the PD pods and TiDB Operator:

kubectl -n my-namespace get events --sort-by='.metadata.creationTimestamp'
  1. Handle missing members: If there is indeed a missing member that cannot be automatically recovered, you may need to manually remove the unavailable member from the PD cluster and restore cluster quorum. This process involves removing the unavailable member and starting a new cluster with the remaining healthy members. However, this should be done with caution, and it is very important to back up data before proceeding.

  2. Verify configuration: Ensure that the PD cluster configuration is correct and that all PD instances are pointing to each other’s correct endpoints. Pay attention to the join configuration parameter and ensure it points to the correct URL.

  3. Monitor hardware resources: Check the hardware resources (CPU, memory, disk, etc.) on the physical nodes hosting the PD pods and verify if there are any resource-related issues that might be causing the failure.

| username: redgame | Original post link

Kicked out, re-added.

| username: TiDBer_G64jJ9u8 | Original post link

This is the PD log from another node.
The environment has already been compromised, so we restored it by deleting the faulty PD’s data and restarting.
However, as a cluster, it must maintain high reliability and operation in the event of any node failure, which is a basic requirement. Therefore, further analysis of the cause is needed.
lgg_pd_log (5.1 MB)