PD Cluster Crashed Again Within a Month

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 一月之内pd集群再次挂掉

| username: wfxxh

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.3
[Encountered Problem: Phenomenon and Impact] The PD cluster crashed twice within a month
[Resource Configuration] (128G+40 cores+2T NVMe) * 5 nodes

Figure 1:

Figure 2:

Figure 3:

Figure 4:

Figure 5:

Figure 6:

pd_63.log (2.6 KB)
pd_62.log (732.5 KB)
pd_61.log (2.6 KB)

The PD crashed twice within a month, making the entire cluster inaccessible.

| username: guobayang176737fb09164af8 | Original post link

The questioner is a colleague, and I will supplement the current observed timeline.

  1. 2023/01/14 00:09:54

pd follower 61 and 63 reported alarms and logs, continuously for more than ten times, once per second

[2023/01/14 00:09:54.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:09:55.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:09:56.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:09:57.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:09:58.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:09:59.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
[2023/01/14 00:10:00.890 +08:00] [WARN] [manager.go:102] ["failed to reload persist options"]
...
  1. 2023/01/14 00:14:00
    pd leader monitoring, grpc requests wait and tso exceed 10 seconds, and the tso rate suddenly increased to 500qps (previously it was 0)


Partial logs are as follows:

[2023/01/14 00:11:50.857 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=ticdc] [expire-at=1673712710] [safepoint=438731094848176136]
[2023/01/14 00:12:50.860 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=ticdc] [expire-at=1673712770] [safepoint=438731110589661194]
[2023/01/14 00:13:51.010 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=ticdc] [expire-at=1673712830] [safepoint=438731125492547608]
[2023/01/14 00:13:54.535 +08:00] [INFO] [periodic.go:135] ["starting auto periodic compaction"] [revision=39749914] [compact-period=1h0m0s]
[2023/01/14 00:13:54.536 +08:00] [INFO] [index.go:189] ["compact tree index"] [revision=39749914]
[2023/01/14 00:13:54.537 +08:00] [INFO] [periodic.go:146] ["completed auto periodic compaction"] [revision=39749914] [compact-period=1h0m0s] [took=1h0m0.004980515s]
[2023/01/14 00:13:55.392 +08:00] [INFO] [kvstore_compaction.go:55] ["finished scheduled compaction"] [compact-revision=39749914] [took=855.063741ms]
[2023/01/14 00:14:51.209 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=ticdc] [expire-at=1673712891] [safepoint=438731142073417748]
[2023/01/14 00:14:55.959 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=4387084949
02493184]
[2023/01/14 00:16:38.913 +08:00] [INFO] [grpc_service.go:1287] ["updated gc safe point"] [safe-point=438708494902493184]
[2023/01/14 00:20:22.006 +08:00] [INFO] [operator_controller.go:440] ["add operator"] [region-id=46845377] [operator="\"transfer-hot-read-leader {transfer leader: store 5 t
o 4} (kind:hot-region,leader, region:46845377(20817,41987), createAt:2023-01-14 00:20:22.006599095 +0800 CST m=+2034391.895374284, startAt:0001-01-01 00:00:00 +0000 UTC, cu
rrentStep:0, steps:[transfer leader from store 5 to store 4])\""] [additional-info=]
[2023/01/14 00:20:22.006 +08:00] [INFO] [operator_controller.go:638] ["send schedule command"] [region-id=46845377] [step="transfer leader from store 5 to store 4"] [source
=create]
[2023/01/14 00:20:22.008 +08:00] [INFO] [region.go:549] ["leader changed"] [region-id=46845377] [from=5] [to=4]
[2023/01/14 00:20:22.008 +08:00] [INFO] [operator_controller.go:555] ["operator finish"] [region-id=46845377] [takes=1.586244ms] [operator="\"transfer-hot-read-leader {tran
sfer leader: store 5 to 4} (kind:hot-region,leader, region:46845377(20817,41987), createAt:2023-01-14 00:20:22.006599095 +0800 CST m=+2034391.895374284, startAt:2023-01-14
00:20:22.006833891 +0800 CST m=+2034391.895609077, currentStep:1, steps:[transfer leader from store 5 to store 4]) finished\""] [additional-info=]
[2023/01/14 00:21:46.234 +08:00] [INFO] [operator_controller.go:440] ["add operator"] [region-id=46975492] [operator="\"transfer-hot-write-leader {transfer leader: store 4
to 7} (kind:hot-region,leader, region:46975492(15735,3806), createAt:2023-01-14 00:21:46.234616792 +0800 CST m=+2034476.123391980, startAt:0001-01-01 00:00:00 +0000 UTC, cu
rrentStep:0, steps:[transfer leader from store 4 to store 7])\""] [additional-info=]
[2023/01/14 00:21:46.234 +08:00] [INFO] [operator_controller.go:638] ["send schedule command"] [region-id=46975492] [step="transfer leader from store 4 to store 7"] [source
=create]
[2023/01/14 00:21:46.237 +08:00] [INFO] [region.go:549] ["leader changed"] [region-id=46975492] [from=4] [to=7]
[2023/01/14 00:21:46.237 +08:00] [INFO] [operator_controller.go:555] ["operator finish"] [region-id=46975492] [takes=2.933989ms] [operator="\"transfer-hot-write-leader {tra
nsfer leader: store 4 to 7} (kind:hot-region,leader, region:46975492(15735,3806), createAt:2023-01-14 00:21:46.234616792 +0800 CST m=+2034476.123391980, startAt:2023-01-14
00:21:46.234828062 +0800 CST m=+2034476.123603264, currentStep:1, steps:[transfer leader from store 4 to store 7]) finished\""] [additional-info=]
[2023/01/14 00:24:55.935 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=4387086521
89155328]
[2023/01/14 00:26:38.864 +08:00] [INFO] [grpc_service.go:1287] ["updated gc safe point"] [safe-point=438708652189155328]
[2023/01/14 00:34:55.959 +08:00] [INFO] [grpc_service.go:1353] ["update service GC safe point"] [service-id=gc_worker] [expire-at=9223372036854775807] [safepoint=4387088094
75293184]
[2023/01/14 00:36:38.860 +08:00] [INFO] [grpc_service.go:1287] ["updated gc safe point"] [safe-point=438708809475293184]
[2023/01/14 00:39:22.613 +08:00] [WARN] [allocator_manager.go:608] ["failed to update allocator's timestamp"] [dc-location=global] [error="[PD:etcd:ErrEtcdTxnConflict]etcd
transaction failed, conflicted and rolled back"]
[2023/01/14 00:39:22.613 +08:00] [INFO] [tso.go:406] ["reset the timestamp in memory"]
[2023/01/14 00:39:22.614 +08:00] [INFO] [lease.go:153] ["stop lease keep alive worker"] [purpose="pd leader election"]
[2023/01/14 00:39:22.643 +08:00] [INFO] [server.go:1319] ["no longer a leader because lease has expired, pd leader will step down"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:110] ["patrol regions has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:799] ["scheduler has been stopped"] [scheduler-name=balance-hot-region-scheduler] [error="context canceled"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:285] ["drive push operator has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [cluster.go:331] ["metrics are reset"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [server.go:119] ["region syncer has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:799] ["scheduler has been stopped"] [scheduler-name=balance-leader-scheduler] [error="context canceled"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [cluster.go:333] ["background jobs has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:220] ["check suspect key ranges has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [cluster.go:353] ["statistics background jobs has been stopped"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [cluster.go:375] ["coordinator is stopping"]
[2023/01/14 00:39:22.644 +08:00] [INFO] [coordinator.go:799] ["scheduler has been stopped"] [scheduler-name=balance-region-scheduler] [error="context canceled"]
[2023/01/14 00:39:22.645 +08:00] [INFO] [cluster.go:371] ["coordinator has been stopped"]
[2023/01/14 00:39:22.645 +08:00] [INFO] [cluster.go:404] ["raftcluster is stopped"]
[2023/01/14 00:39:22.645 +08:00] [INFO] [tso.go:406] ["reset the timestamp in memory"]
[2023/01/14 00:39:22.646 +08:00] [ERROR] [member.go:167] ["getting pd leader meets error"] [error="[PD:proto:ErrProtoUnmarshal]proto: illegal wireType 7"]
[2023/01/14 00:39:22.847 +08:00] [ERROR] [member.go:167] ["getting pd leader meets error"] [error="[PD:proto:ErrProtoUnmarshal]proto: illegal wireType 7"]
[2023/01/14 00:39:23.048 +08:00] [ERROR] [member.go:167] ["getting pd leader meets error"] [error="[PD:proto:ErrProtoUnmarshal]proto: illegal wireType 7"]
[2023/01/14 00:39:23.249 +08:00] [ERROR] [member.go:167] ["getting pd leader meets error"] [error="[PD:proto:ErrProtoUnmarshal]proto: illegal wireType 7"]
[2023/01/14 00:39:23.449 +08:00] [ERROR] [member.go:167] ["getting pd leader meets error"] [error="[PD:proto:ErrProtoUnmarshal]proto: illegal wireType 7"]

  1. 2023/01/14 00:15:25
    cdc reported an error (synchronizing data to the backup tidb), as shown in the figure below
| username: xfworld | Original post link

From the submitted logs, it appears that the PD cluster is completely disconnected, which basically confirms that the distributed system is ineffective…

For troubleshooting, there are a few directions to consider:

  1. Resource IO bottlenecks, such as CPU, memory, and disk (the chart already describes no bottlenecks in CPU and memory, you can check the disk IO)

  2. Network bottlenecks (the cluster configuration is already described, but what is the network bandwidth between the cluster nodes?)

  3. Network latency between nodes (observe the gRPC-related network information of PD nodes through Grafana)

I would prioritize considering network disconnection…
Referring to the provided logs, there has been no leader since 39 minutes past the hour…

[2023/01/14 00:39:24.694 +08:00] [ERROR] [middleware.go:102] 
["redirect but server is not leader"] 
[from=pd_log63] [server=pd_log62] 
[error="[PD:apiutil:ErrRedirect]redirect failed"]
[2023/01/14 00:39:24.695 +08:00] [ERROR] [middleware.go:102] 
["redirect but server is not leader"] 
[from=pd_log61] [server=pd_log62] 
[error="[PD:apiutil:ErrRedirect]redirect failed"]

From the above, it is clear that all PDs are completely disconnected…


Additionally, please consider the default configuration of ETCD:
The default timeout for etcd is 7 seconds (5 seconds disk IO delay + 2*1 second election timeout). If a request times out without returning a result, you may encounter the familiar etcdserver: request timed out error.

| username: wfxxh | Original post link

Disk I/O and network can be seen in the system info of the second figure, and there is no persistent high situation.

| username: wfxxh | Original post link

From Figure 3 or this image, it can be seen that there is no high latency situation with etcd either.

| username: guobayang176737fb09164af8 | Original post link

Check out the reason for TSO slow query in the article by Biaomei. It doesn’t mention TSO slow requests, only that it was unable to connect to PD, causing the number of requests to drop sharply to 0.
Additionally, it was confirmed that during this period, there were no network issues, but at 0:15, the number of TCP connections dropped (did TiCDC crash first?).

However, looking at the requests from TiDB PD, TSO requests were being processed continuously until 0:39.

| username: xfworld | Original post link

Without the support of the brain, other components might all have issues :rofl:

This is consistent with the final result…

ticdc also needs PD for support, such as version information recording, TSO production…