High CPU Load on PD Leader

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: pd leader cpu负载高

| username: wfxxh

[TiDB Usage Environment] Production environment
[TiDB Version] v5.4.3
[Encountered Issue: Symptoms and Impact] The PD leader experiences high CPU load at irregular intervals every day.
[Resource Configuration] 48 cores, 256G, (3T NVMe) * 2

Data cold storage node, usually has no access traffic. Every time there is high load on the monitor, it has been confirmed to be caused by PD.

pd.log (790.0 KB)

| username: h5n1 | Original post link

Check the monitoring on the PD page.

| username: 我是咖啡哥 | Original post link

From the logs, it appears that most of the activity is related to write hotspot region scheduling. Could it be that a certain table has a severe write hotspot, causing PD to schedule frequently?
[operator=""move-hot-write-peer {mv peer: store [15] to [274433474]}

| username: wfxxh | Original post link

The timing doesn’t match.

| username: yilong | Original post link

  1. In the dashboard under “Advanced Debugging–Performance Analysis,” is continuous analysis enabled? If so, what is consuming the most CPU in the PD relationship graph?
    If not, you can manually collect data when the CPU usage is high to check.
| username: wfxxh | Original post link

It does appear to be the scheduler’s usage, but it doesn’t match my write operations.

| username: yilong | Original post link

You can check the PD monitoring to see if the changes in regions and leaders match the CPU usage. Check if there is a lot of scheduling during CPU peak periods. It could be caused by hotspots, or by adding or removing nodes.

| username: wfxxh | Original post link

It doesn’t match either.

| username: 裤衩儿飞上天 | Original post link

Is there any other application mixed deployment on the node 10.1.3.121? Is there a scheduled task every 12 hours?

| username: wfxxh | Original post link

It’s not fixed at every 12 hours, it is indeed caused by PD.

| username: 裤衩儿飞上天 | Original post link

Check the monitoring on the PD page.

Post the monitoring and logs for the corresponding time period.

| username: wfxxh | Original post link

The logs were uploaded at the very beginning.

| username: 裤衩儿飞上天 | Original post link

Did you deploy PD, TiKV, and TiFlash on this node? Are there any others?

| username: wfxxh | Original post link

Yes, but it is indeed PD that is causing the high load.

| username: wfxxh | Original post link

Currently, I have tried shutting down services suspected of high write activity and disabling auto analyze, but the issue persists.

| username: wfxxh | Original post link

Help needed, bumping the thread.