PD takes too long during "coordinator is stopping," exceeding 24 hours, causing QPS to drop to zero

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: PD在进行coordinator is stopping时耗时过长,超过24小时,导致QPS跌零

| username: in-han

【TiDB Usage Environment】Production Environment
【TiDB Version】v5.3.0
【Encountered Problem】PD leader lease expired, the step “coordinator is stopping” took too long, it took 24 hours to reach “coordinator is stopped”, after which the cluster returned to normal.
【Reproduction Path】
Cluster scale: 2000 tikv-server instances; 200k regions; 3 PD nodes;
【Problem Phenomenon and Impact】
This phenomenon occurs after the lease expires;
At this time, CPU idle dropped to 30% (from 80% to 30%), with no significant changes in memory and disk IO;

Logs:
[kvstore@ip-xxx log]$ grep --color coordinator pd.log
[2022/06/19 01:31:04.686 +08:00] [INFO] [cluster.go:372] [“coordinator is stopping”]
[2022/06/19 01:31:04.686 +08:00] [INFO] [coordinator.go:285] [“drive push operator has been stopped”]
[2022/06/19 01:31:04.686 +08:00] [INFO] [coordinator.go:220] [“check suspect key ranges has been stopped”]
[2022/06/19 01:39:51.198 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-leader-scheduler] [error=“context canceled”]
[2022/06/19 02:41:52.496 +08:00] [INFO] [coordinator.go:110] [“patrol regions has been stopped”]
[2022/06/19 16:17:36.003 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-hot-region-scheduler] [error=“context canceled”]
[2022/06/20 08:42:31.097 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-region-scheduler] [error=“context canceled”]
[2022/06/20 08:42:31.097 +08:00] [INFO] [cluster.go:368] [“coordinator has been stopped”]
[2022/06/20 08:42:33.496 +08:00] [INFO] [coordinator.go:296] [“coordinator starts to collect cluster information”]
[2022/06/20 08:47:33.497 +08:00] [INFO] [coordinator.go:299] [“coordinator has finished cluster information preparation”]
[2022/06/20 08:47:33.497 +08:00] [INFO] [coordinator.go:309] [“coordinator starts to run schedulers”]
[2022/06/20 08:47:33.498 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-hot-region-scheduler]
[2022/06/20 08:47:33.499 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-leader-scheduler]
[2022/06/20 08:47:33.500 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-region-scheduler]
[2022/06/20 08:47:33.500 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-region-scheduler] [scheduler-args=“”]
[2022/06/20 08:47:33.501 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-leader-scheduler] [scheduler-args=“”]
[2022/06/20 08:47:33.502 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-hot-region-scheduler] [scheduler-args=“”]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:279] [“coordinator begins to actively drive push operator”]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:102] [“coordinator starts patrol regions”]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:214] [“coordinator begins to check suspect key ranges”]

Update:
Network is normal, the total number of regions in the cluster is 200k.
The cluster is deployed on HDD disks, so PD lease easily expires. After adjusting the lease to 5s, PD no longer frequently switches leaders.

The main issue reported is that the coordinator stop takes too long.

Analysis reveals two main reasons:

  1. During the failure, the PD instance consumes too much CPU (50 core CPU), causing the scheduler to run slowly and unable to end in time;
  2. The scheduler’s computational complexity is too high, 2000 (tikv nodes) * 10 (retry times) * 2000 (tikv nodes) * 5 (filter count) nearly 200 million calculations. Under normal machine load, a single scheduling takes tens of minutes. This time is too long, and before the scheduling ends, PD is in a leaderless state, which affects client access.
| username: yilong | Original post link

  1. You can check if the network was stable during that time.
  2. Does a single TiKV have 200,000 regions? This number is a bit too high. You might want to consider expanding TiKV to distribute the load.
| username: in-han | Original post link

The network is normal, and the total number of regions in the cluster is 200,000. The cluster is deployed on HDD disks, so the PD lease tends to expire easily. After adjusting the lease to 5 seconds, PD no longer frequently switches leaders.

The reported issue is mainly that the coordinator stop takes too long.

Upon analysis, there are two main reasons:

  1. During the failure, the PD instance consumes too much CPU (50 core CPU), causing the scheduler to run slowly and unable to end in time;
  2. The computational complexity of the scheduler is too high, with nearly 200 million calculations: 2000 (TiKV nodes) * 10 (retry times) * 2000 (TiKV nodes) * 5 (filter count). Under normal machine load, a single scheduling takes tens of minutes. This time is too long, and before the scheduling ends, PD is in a leaderless state, which affects client access.
| username: Min_Chen | Original post link

It is still recommended to use SSD for PD, at least with a SATA interface. The Raft protocol is inherently a strongly real-time protocol, and setting the time too long defeats its purpose.

| username: in-han | Original post link

Understood, this cluster is mainly used for data archiving storage and has no requirements for read performance. Due to the large number of tikv-server instances in the cluster, it has been observed that the PD scheduler is very time-consuming, and it cannot exit midway when the coordinator is closed.

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.