After upgrading to 5.4.2, the CPU of the PD Leader is always fully utilized

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 升级到5.4.2后,PD Leader的CPU一直占满

| username: Zealot

After upgrading from 5.2.2 to 5.4.2, the PD Leader has been in a state of full CPU usage for a week. I don’t know how to troubleshoot this.

| username: ti-tiger | Original post link

First, take a look at the logs.

| username: Zealot | Original post link

Are you referring to the log files in the PD installation directory?

| username: Zealot | Original post link

[2022/08/08 09:46:27.508 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=293.575883ms] [prev-physical=2022/08/08 09:46:27.214 +08:00] [now=2022/08/08 09:46:27.508 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:28.109 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=601.258977ms] [prev-physical=2022/08/08 09:46:27.508 +08:00] [now=2022/08/08 09:46:28.109 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:28.602 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=194.190784ms] [prev-physical=2022/08/08 09:46:28.408 +08:00] [now=2022/08/08 09:46:28.602 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:29.003 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=291.640443ms] [prev-physical=2022/08/08 09:46:28.712 +08:00] [now=2022/08/08 09:46:29.003 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:30.020 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=1.016291668s] [prev-physical=2022/08/08 09:46:29.003 +08:00] [now=2022/08/08 09:46:30.019 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:31.006 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=202.307569ms] [prev-physical=2022/08/08 09:46:30.804 +08:00] [now=2022/08/08 09:46:31.006 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:32.008 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=295.607662ms] [prev-physical=2022/08/08 09:46:31.712 +08:00] [now=2022/08/08 09:46:32.008 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:32.408 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=204.619156ms] [prev-physical=2022/08/08 09:46:32.204 +08:00] [now=2022/08/08 09:46:32.408 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:35.407 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=103.486777ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/ddl/all_schema_versions" range_end:"/tidb/ddl/all_schema_versiont" “] [response=“range_response_count:4 size:424”]
[2022/08/08 09:46:35.408 +08:00] [INFO] [trace.go:145] [“trace[559434866] range”] [detail=”{range_begin:/tidb/ddl/all_schema_versions; range_end:/tidb/ddl/all_schema_versiont; response_count:4; response_revision:681330326; }”] [duration=103.552858ms] [start=2022/08/08 09:46:35.304 +08:00] [end=2022/08/08 09:46:35.407 +08:00] [steps=”["trace[559434866] ‘agreement among raft nodes before linearized reading’ (duration: 103.395062ms)"]”]
[2022/08/08 09:46:35.408 +08:00] [INFO] [trace.go:145] [“trace[131189320] put”] [detail=“{key:/topology/tidb/tidb-0002:4000/ttl; req_size:66; response_revision:681330327; }”] [duration=102.197829ms] [start=2022/08/08 09:46:35.305 +08:00] [end=2022/08/08 09:46:35.408 +08:00] [steps=“["trace[131189320] ‘process raft request’ (duration: 102.081561ms)"]”]
[2022/08/08 09:46:35.805 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=199.957909ms] [prev-physical=2022/08/08 09:46:35.605 +08:00] [now=2022/08/08 09:46:35.805 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:38.309 +08:00] [INFO] [cluster_worker.go:128] [“alloc ids for region split”] [region-id=6959609] [peer-ids=“[6959610,6959611,6959612]”]
[2022/08/08 09:46:38.402 +08:00] [INFO] [cluster_worker.go:220] [“region batch split, generate new regions”] [region-id=6953001] [origin=“id:6959609 start_key:"7480000000000364FF3900000000000000F8" end_key:"7480000000000364FF3C00000000000000F8" region_epoch:<conf_ver:6753 version:85488 > peers:<id:6959610 store_id:2853742 > peers:<id:6959611 store_id:2853733 > peers:<id:6959612 store_id:2853787 >”] [total=1]
[2022/08/08 09:46:38.402 +08:00] [INFO] [region.go:528] [“region Version changed”] [region-id=6953001] [detail=“StartKey Changed:{7480000000000364FF3900000000000000F8} → {7480000000000364FF3C00000000000000F8}, EndKey:{}”] [old-version=85487] [new-version=85488]
[2022/08/08 09:46:38.405 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=292.83899ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/store/gcworker/saved_safe_point" “] [response=“range_response_count:1 size:84”]
[2022/08/08 09:46:38.405 +08:00] [INFO] [trace.go:145] [“trace[455350557] range”] [detail=”{range_begin:/tidb/store/gcworker/saved_safe_point; range_end:; response_count:1; response_revision:681330343; }”] [duration=292.902662ms] [start=2022/08/08 09:46:38.112 +08:00] [end=2022/08/08 09:46:38.405 +08:00] [steps=”["trace[455350557] ‘agreement among raft nodes before linearized reading’ (duration: 292.759514ms)"]”]
[2022/08/08 09:46:38.405 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=193.120277ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/topology/tidb/" range_end:"/topology/tidb0" “] [response=“range_response_count:8 size:1375”]
[2022/08/08 09:46:38.405 +08:00] [INFO] [trace.go:145] [“trace[305887116] range”] [detail=”{range_begin:/topology/tidb/; range_end:/topology/tidb0; response_count:8; response_revision:681330343; }”] [duration=193.17967ms] [start=2022/08/08 09:46:38.212 +08:00] [end=2022/08/08 09:46:38.405 +08:00] [steps=”["trace[305887116] ‘agreement among raft nodes before linearized reading’ (duration: 193.013382ms)"]”]
[2022/08/08 09:46:38.410 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=1.607190539s] [prev-physical=2022/08/08 09:46:36.803 +08:00] [now=2022/08/08 09:46:38.410 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:40.404 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=501.991535ms] [prev-physical=2022/08/08 09:46:39.902 +08:00] [now=2022/08/08 09:46:40.404 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:41.217 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=308.272273ms] [prev-physical=2022/08/08 09:46:40.909 +08:00] [now=2022/08/08 09:46:41.217 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:41.702 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=186.389151ms] [prev-physical=2022/08/08 09:46:41.516 +08:00] [now=2022/08/08 09:46:41.702 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:42.502 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=196.190637ms] [prev-physical=2022/08/08 09:46:42.305 +08:00] [now=2022/08/08 09:46:42.502 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:42.709 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=204.088922ms] [prev-physical=2022/08/08 09:46:42.505 +08:00] [now=2022/08/08 09:46:42.709 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:42.803 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=288.651938ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7044348576070183058/gc/safe_point" “] [response=“range_response_count:1 size:81”]
[2022/08/08 09:46:42.803 +08:00] [INFO] [trace.go:145] [“trace[1774603517] range”] [detail=”{range_begin:/pd/7044348576070183058/gc/safe_point; range_end:; response_count:1; response_revision:681330347; }”] [duration=288.743843ms] [start=2022/08/08 09:46:42.515 +08:00] [end=2022/08/08 09:46:42.803 +08:00] [steps=”["trace[1774603517] ‘range keys from in-memory index tree’ (duration: 287.902006ms)"]”]
[2022/08/08 09:46:42.804 +08:00] [INFO] [operator_controller.go:437] [“add operator”] [region-id=4611780] [operator=“"transfer-hot-read-leader {transfer leader: store 2853733 to 2853742} (kind:hot-region,leader, region:4611780(12683,809), createAt:2022-08-08 09:46:42.804286055 +0800 CST m=+58403.773526759, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 2853733 to store 2853742])"”] [additional-info=]
[2022/08/08 09:46:42.804 +08:00] [INFO] [operator_controller.go:635] [“send schedule command”] [region-id=4611780] [step=“transfer leader from store 2853733 to store 2853742”] [source=create]
[2022/08/08 09:46:42.809 +08:00] [INFO] [region.go:549] [“leader changed”] [region-id=4611780] [from=2853733] [to=2853742]
[2022/08/08 09:46:42.809 +08:00] [INFO] [operator_controller.go:552] [“operator finish”] [region-id=4611780] [takes=4.746357ms] [operator=“"transfer-hot-read-leader {transfer leader: store 2853733 to 2853742} (kind:hot-region,leader, region:4611780(12683,809), createAt:2022-08-08 09:46:42.804286055 +0800 CST m=+58403.773526759, startAt:2022-08-08 09:46:42.804678425 +0800 CST m=+58403.773919141, currentStep:1, steps:[transfer leader from store 2853733 to store 2853742]) finished"”] [additional-info=]
[2022/08/08 09:46:45.202 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=196.708225ms] [prev-physical=2022/08/08 09:46:45.006 +08:00] [now=2022/08/08 09:46:45.202 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:45.605 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=197.798723ms] [prev-physical=2022/08/08 09:46:45.407 +08:00] [now=2022/08/08 09:46:45.605 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:46.602 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=189.783443ms] [prev-physical=2022/08/08 09:46:46.412 +08:00] [now=2022/08/08 09:46:46.602 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:46.802 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=196.359192ms] [prev-physical=2022/08/08 09:46:46.605 +08:00] [now=2022/08/08 09:46:46.802 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:47.005 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=191.775348ms] [prev-physical=2022/08/08 09:46:46.813 +08:00] [now=2022/08/08 09:46:47.005 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:47.319 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=214.364387ms] [prev-physical=2022/08/08 09:46:47.104 +08:00] [now=2022/08/08 09:46:47.319 +08

| username: Zealot | Original post link

I don’t know if it will be useful.

| username: h5n1 | Original post link

Export the monitoring data for pd/tikv/tidb/overview as per the instructions above. Ensure that all panels are expanded before exporting.

| username: Zealot | Original post link

PD: tidb-prod-PD_2022-08-08T07_04_09.231Z.json (2.3 MB)

| username: Zealot | Original post link

TiDB: tidb-prod-TiDB_2022-08-08T07_00_32.815Z.json (3.4 MB)

| username: Zealot | Original post link

Overview: tidb-prod-Overview_2022-08-08T07_05_52.884Z.json (1.4 MB)

| username: Zealot | Original post link

TiKV: tidb-prod-TiKV-Details_2022-08-08T07_06_50.427Z.json (10.4 MB)

| username: h5n1 | Original post link

  1. There are still some tombstone store information residues, clean them up: pd-ctl -u http://pd_ip:2379 store remove-tombstone
    image
  2. There are a large number of table scatter schedules, it looks like each table has added this schedule. Use pd-ctl scheduler show/remove to check and clean up these scatter schedules. Generally, it is not necessary to add scatter schedules to each table, the balance scheduler can complete the balanced scheduling. Handle special cases separately.
| username: Zealot | Original post link

Thank you, indeed after removing the scatter scheduling, the CPU usage has decreased. I will continue to monitor it.

| username: HACK | Original post link

Is table scatter scheduling used for manually balancing regions?

| username: Zealot | Original post link

Yes, to solve the hotspot table issue before, I executed the operator add split-region command, which resulted in these table scatter tasks. But this hotspot issue should be a bug, and I haven’t encountered this problem since upgrading to version 5.4.2.

| username: system | Original post link

This topic will be automatically closed 60 days after the last reply. No new replies are allowed.