Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: PD在进行coordinator is stopping时耗时过长,超过24小时,导致QPS跌零
【TiDB Usage Environment】Production Environment
【TiDB Version】v5.3.0
【Encountered Problem】PD leader lease expired, the step “coordinator is stopping” took too long, it took 24 hours to reach “coordinator is stopped”, after which the cluster returned to normal.
【Reproduction Path】
Cluster scale: 2000 tikv-server instances; 200k regions; 3 PD nodes;
【Problem Phenomenon and Impact】
This phenomenon occurs after the lease expires;
At this time, CPU idle dropped to 30% (from 80% to 30%), with no significant changes in memory and disk IO;
Logs:
[kvstore@ip-xxx log]$ grep --color coordinator pd.log
[2022/06/19 01:31:04.686 +08:00] [INFO] [cluster.go:372] [“coordinator is stopping”]
[2022/06/19 01:31:04.686 +08:00] [INFO] [coordinator.go:285] [“drive push operator has been stopped”]
[2022/06/19 01:31:04.686 +08:00] [INFO] [coordinator.go:220] [“check suspect key ranges has been stopped”]
[2022/06/19 01:39:51.198 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-leader-scheduler] [error=“context canceled”]
[2022/06/19 02:41:52.496 +08:00] [INFO] [coordinator.go:110] [“patrol regions has been stopped”]
[2022/06/19 16:17:36.003 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-hot-region-scheduler] [error=“context canceled”]
[2022/06/20 08:42:31.097 +08:00] [INFO] [coordinator.go:796] [“scheduler has been stopped”] [scheduler-name=balance-region-scheduler] [error=“context canceled”]
[2022/06/20 08:42:31.097 +08:00] [INFO] [cluster.go:368] [“coordinator has been stopped”]
[2022/06/20 08:42:33.496 +08:00] [INFO] [coordinator.go:296] [“coordinator starts to collect cluster information”]
[2022/06/20 08:47:33.497 +08:00] [INFO] [coordinator.go:299] [“coordinator has finished cluster information preparation”]
[2022/06/20 08:47:33.497 +08:00] [INFO] [coordinator.go:309] [“coordinator starts to run schedulers”]
[2022/06/20 08:47:33.498 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-hot-region-scheduler]
[2022/06/20 08:47:33.499 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-leader-scheduler]
[2022/06/20 08:47:33.500 +08:00] [INFO] [coordinator.go:357] [“create scheduler with independent configuration”] [scheduler-name=balance-region-scheduler]
[2022/06/20 08:47:33.500 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-region-scheduler] [scheduler-args=“”]
[2022/06/20 08:47:33.501 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-leader-scheduler] [scheduler-args=“”]
[2022/06/20 08:47:33.502 +08:00] [INFO] [coordinator.go:379] [“create scheduler”] [scheduler-name=balance-hot-region-scheduler] [scheduler-args=“”]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:279] [“coordinator begins to actively drive push operator”]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:102] [“coordinator starts patrol regions”]
[2022/06/20 08:47:33.505 +08:00] [INFO] [coordinator.go:214] [“coordinator begins to check suspect key ranges”]
Update:
Network is normal, the total number of regions in the cluster is 200k.
The cluster is deployed on HDD disks, so PD lease easily expires. After adjusting the lease to 5s, PD no longer frequently switches leaders.
The main issue reported is that the coordinator stop takes too long.
Analysis reveals two main reasons:
- During the failure, the PD instance consumes too much CPU (50 core CPU), causing the scheduler to run slowly and unable to end in time;
- The scheduler’s computational complexity is too high, 2000 (tikv nodes) * 10 (retry times) * 2000 (tikv nodes) * 5 (filter count) nearly 200 million calculations. Under normal machine load, a single scheduling takes tens of minutes. This time is too long, and before the scheduling ends, PD is in a leaderless state, which affects client access.