High CPU Usage on a Single PD in the Cluster

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群中单个pd cpu高

| username: TiDBer_yangxi

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] What operations were performed that caused the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
PD and TiDB, DM worker, and CDC are all deployed together, and there is a DM task source on this IP.
[Attachments: Screenshots/Logs/Monitoring]

In the cluster, there will be:
[2023/04/11 08:43:52.845 +08:00] [WARN] [pd.go:99] [“get timestamp too slow”] [“cost time”=163.587758ms]
[2023/04/11 15:49:35.916 +08:00] [WARN] [pd.go:99] [“get timestamp too slow”] [“cost time”=175.792499ms]
[2023/04/11 16:22:15.079 +08:00] [WARN] [pd.go:99] [“get timestamp too slow”] [“cost time”=167.28991ms]
[2023/04/11 16:31:47.103 +08:00] [WARN] [pd.go:234] [“get timestamp too slow”] [“cost time”=142.124023ms]
[2023/04/11 17:02:24.281 +08:00] [WARN] [pd.go:99] [“get timestamp too slow”] [“cost time”=171.315344ms]
[2023/04/11 18:10:14.655 +08:00] [WARN] [pd.go:99] [“get timestamp too slow”] [“cost time”=222.338775ms]

Suspect it’s caused by this PD.

| username: dbaspace | Original post link

Check the following three aspects:

  1. PD should use at least 4 CPU cores.
  2. Check if there are any TiKV restarts.
  3. In a mixed deployment, check if there is network pressure.

If there is an impact on writing, the cluster slows down, or there is a 9001 error, you can try switching the PD LEADER.

| username: tidb狂热爱好者 | Original post link

Mixed deployment can easily kill your own job.

| username: tidb菜鸟一只 | Original post link

Are all the other PDs deployed in a mixed manner? How is the load on the other machines? If they are idle, try switching the leader to another PD node and see.

| username: xingzhenxiang | Original post link

If it’s the leader, try switching it. If not, expand to another machine and then shrink this PD.

| username: TiDBer_yangxi | Original post link

The main issue is checking “get timestamp too slow”. I noticed that the CPU usage of the leader PD-server is a bit high, while the other two PD-servers, which are also mixed deployments, have very low CPU usage. I’m not sure if this kind of leader behavior is normal, but the overall CPU usage of the three machines is not high, less than 10%.

| username: knull | Original post link

You can check the “pd tso wait/rpc duration” under the performance-overview monitoring at the corresponding time point, as shown in the figure below:

| username: TiDBer_qijtMMBk | Original post link

Encountered the same problem, has it been resolved?

| username: zxgaa | Original post link

TiKV is consuming CPU.

| username: Kongdom | Original post link

It is recommended to start a new thread to seek help, as new threads can get more attention.

Are you also using a mixed deployment? It is advised not to use a mixed deployment, or to implement resource control to avoid resource contention.