Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 升级到5.4.2后,PD Leader的CPU一直占满
After upgrading from 5.2.2 to 5.4.2, the PD Leader has been in a state of full CPU usage for a week. I don’t know how to troubleshoot this.
First, take a look at the logs.
Are you referring to the log files in the PD installation directory?
[2022/08/08 09:46:27.508 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=293.575883ms] [prev-physical=2022/08/08 09:46:27.214 +08:00] [now=2022/08/08 09:46:27.508 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:28.109 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=601.258977ms] [prev-physical=2022/08/08 09:46:27.508 +08:00] [now=2022/08/08 09:46:28.109 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:28.602 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=194.190784ms] [prev-physical=2022/08/08 09:46:28.408 +08:00] [now=2022/08/08 09:46:28.602 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:29.003 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=291.640443ms] [prev-physical=2022/08/08 09:46:28.712 +08:00] [now=2022/08/08 09:46:29.003 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:30.020 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=1.016291668s] [prev-physical=2022/08/08 09:46:29.003 +08:00] [now=2022/08/08 09:46:30.019 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:31.006 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=202.307569ms] [prev-physical=2022/08/08 09:46:30.804 +08:00] [now=2022/08/08 09:46:31.006 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:32.008 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=295.607662ms] [prev-physical=2022/08/08 09:46:31.712 +08:00] [now=2022/08/08 09:46:32.008 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:32.408 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=204.619156ms] [prev-physical=2022/08/08 09:46:32.204 +08:00] [now=2022/08/08 09:46:32.408 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:35.407 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=103.486777ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/ddl/all_schema_versions" range_end:"/tidb/ddl/all_schema_versiont" “] [response=“range_response_count:4 size:424”]
[2022/08/08 09:46:35.408 +08:00] [INFO] [trace.go:145] [“trace[559434866] range”] [detail=”{range_begin:/tidb/ddl/all_schema_versions; range_end:/tidb/ddl/all_schema_versiont; response_count:4; response_revision:681330326; }”] [duration=103.552858ms] [start=2022/08/08 09:46:35.304 +08:00] [end=2022/08/08 09:46:35.407 +08:00] [steps=”["trace[559434866] ‘agreement among raft nodes before linearized reading’ (duration: 103.395062ms)"]”]
[2022/08/08 09:46:35.408 +08:00] [INFO] [trace.go:145] [“trace[131189320] put”] [detail=“{key:/topology/tidb/tidb-0002:4000/ttl; req_size:66; response_revision:681330327; }”] [duration=102.197829ms] [start=2022/08/08 09:46:35.305 +08:00] [end=2022/08/08 09:46:35.408 +08:00] [steps=“["trace[131189320] ‘process raft request’ (duration: 102.081561ms)"]”]
[2022/08/08 09:46:35.805 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=199.957909ms] [prev-physical=2022/08/08 09:46:35.605 +08:00] [now=2022/08/08 09:46:35.805 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:38.309 +08:00] [INFO] [cluster_worker.go:128] [“alloc ids for region split”] [region-id=6959609] [peer-ids=“[6959610,6959611,6959612]”]
[2022/08/08 09:46:38.402 +08:00] [INFO] [cluster_worker.go:220] [“region batch split, generate new regions”] [region-id=6953001] [origin=“id:6959609 start_key:"7480000000000364FF3900000000000000F8" end_key:"7480000000000364FF3C00000000000000F8" region_epoch:<conf_ver:6753 version:85488 > peers:<id:6959610 store_id:2853742 > peers:<id:6959611 store_id:2853733 > peers:<id:6959612 store_id:2853787 >”] [total=1]
[2022/08/08 09:46:38.402 +08:00] [INFO] [region.go:528] [“region Version changed”] [region-id=6953001] [detail=“StartKey Changed:{7480000000000364FF3900000000000000F8} → {7480000000000364FF3C00000000000000F8}, EndKey:{}”] [old-version=85487] [new-version=85488]
[2022/08/08 09:46:38.405 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=292.83899ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/tidb/store/gcworker/saved_safe_point" “] [response=“range_response_count:1 size:84”]
[2022/08/08 09:46:38.405 +08:00] [INFO] [trace.go:145] [“trace[455350557] range”] [detail=”{range_begin:/tidb/store/gcworker/saved_safe_point; range_end:; response_count:1; response_revision:681330343; }”] [duration=292.902662ms] [start=2022/08/08 09:46:38.112 +08:00] [end=2022/08/08 09:46:38.405 +08:00] [steps=”["trace[455350557] ‘agreement among raft nodes before linearized reading’ (duration: 292.759514ms)"]”]
[2022/08/08 09:46:38.405 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=193.120277ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/topology/tidb/" range_end:"/topology/tidb0" “] [response=“range_response_count:8 size:1375”]
[2022/08/08 09:46:38.405 +08:00] [INFO] [trace.go:145] [“trace[305887116] range”] [detail=”{range_begin:/topology/tidb/; range_end:/topology/tidb0; response_count:8; response_revision:681330343; }”] [duration=193.17967ms] [start=2022/08/08 09:46:38.212 +08:00] [end=2022/08/08 09:46:38.405 +08:00] [steps=”["trace[305887116] ‘agreement among raft nodes before linearized reading’ (duration: 193.013382ms)"]”]
[2022/08/08 09:46:38.410 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=1.607190539s] [prev-physical=2022/08/08 09:46:36.803 +08:00] [now=2022/08/08 09:46:38.410 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:40.404 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=501.991535ms] [prev-physical=2022/08/08 09:46:39.902 +08:00] [now=2022/08/08 09:46:40.404 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:41.217 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=308.272273ms] [prev-physical=2022/08/08 09:46:40.909 +08:00] [now=2022/08/08 09:46:41.217 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:41.702 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=186.389151ms] [prev-physical=2022/08/08 09:46:41.516 +08:00] [now=2022/08/08 09:46:41.702 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:42.502 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=196.190637ms] [prev-physical=2022/08/08 09:46:42.305 +08:00] [now=2022/08/08 09:46:42.502 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:42.709 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=204.088922ms] [prev-physical=2022/08/08 09:46:42.505 +08:00] [now=2022/08/08 09:46:42.709 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:42.803 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=288.651938ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:"/pd/7044348576070183058/gc/safe_point" “] [response=“range_response_count:1 size:81”]
[2022/08/08 09:46:42.803 +08:00] [INFO] [trace.go:145] [“trace[1774603517] range”] [detail=”{range_begin:/pd/7044348576070183058/gc/safe_point; range_end:; response_count:1; response_revision:681330347; }”] [duration=288.743843ms] [start=2022/08/08 09:46:42.515 +08:00] [end=2022/08/08 09:46:42.803 +08:00] [steps=”["trace[1774603517] ‘range keys from in-memory index tree’ (duration: 287.902006ms)"]”]
[2022/08/08 09:46:42.804 +08:00] [INFO] [operator_controller.go:437] [“add operator”] [region-id=4611780] [operator=“"transfer-hot-read-leader {transfer leader: store 2853733 to 2853742} (kind:hot-region,leader, region:4611780(12683,809), createAt:2022-08-08 09:46:42.804286055 +0800 CST m=+58403.773526759, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 2853733 to store 2853742])"”] [additional-info=]
[2022/08/08 09:46:42.804 +08:00] [INFO] [operator_controller.go:635] [“send schedule command”] [region-id=4611780] [step=“transfer leader from store 2853733 to store 2853742”] [source=create]
[2022/08/08 09:46:42.809 +08:00] [INFO] [region.go:549] [“leader changed”] [region-id=4611780] [from=2853733] [to=2853742]
[2022/08/08 09:46:42.809 +08:00] [INFO] [operator_controller.go:552] [“operator finish”] [region-id=4611780] [takes=4.746357ms] [operator=“"transfer-hot-read-leader {transfer leader: store 2853733 to 2853742} (kind:hot-region,leader, region:4611780(12683,809), createAt:2022-08-08 09:46:42.804286055 +0800 CST m=+58403.773526759, startAt:2022-08-08 09:46:42.804678425 +0800 CST m=+58403.773919141, currentStep:1, steps:[transfer leader from store 2853733 to store 2853742]) finished"”] [additional-info=]
[2022/08/08 09:46:45.202 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=196.708225ms] [prev-physical=2022/08/08 09:46:45.006 +08:00] [now=2022/08/08 09:46:45.202 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:45.605 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=197.798723ms] [prev-physical=2022/08/08 09:46:45.407 +08:00] [now=2022/08/08 09:46:45.605 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:46.602 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=189.783443ms] [prev-physical=2022/08/08 09:46:46.412 +08:00] [now=2022/08/08 09:46:46.602 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:46.802 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=196.359192ms] [prev-physical=2022/08/08 09:46:46.605 +08:00] [now=2022/08/08 09:46:46.802 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:47.005 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=191.775348ms] [prev-physical=2022/08/08 09:46:46.813 +08:00] [now=2022/08/08 09:46:47.005 +08:00] [update-physical-interval=50ms]
[2022/08/08 09:46:47.319 +08:00] [WARN] [tso.go:321] [“clock offset”] [jet-lag=214.364387ms] [prev-physical=2022/08/08 09:46:47.104 +08:00] [now=2022/08/08 09:46:47.319 +08
I don’t know if it will be useful.
Export the monitoring data for pd/tikv/tidb/overview as per the instructions above. Ensure that all panels are expanded before exporting.
Thank you, indeed after removing the scatter scheduling, the CPU usage has decreased. I will continue to monitor it.
Is table scatter scheduling used for manually balancing regions?
Yes, to solve the hotspot table issue before, I executed the operator add split-region command, which resulted in these table scatter tasks. But this hotspot issue should be a bug, and I haven’t encountered this problem since upgrading to version 5.4.2.
This topic will be automatically closed 60 days after the last reply. No new replies are allowed.