TiDB Suddenly Causes Server CPU and Memory Usage to Reach 100%

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb 突然导致服务器CPU 内存 占用 100%

| username: yinyuncan

[TiDB Usage Environment] Production Environment
[TiDB Version] 7.5.0
[Reproduction Path] Some slow SQLs have high memory usage

[Encountered Problem: Problem Phenomenon and Impact]
TiDB suddenly caused the server’s CPU and memory usage to reach 100%. I hope the experts here can help me analyze the issue.

The server configuration is 16GB RAM and 4-core CPU.

I am a loyal fan of TiDB, but due to limited technical skills, I have no clue about performance analysis and tuning. I have to seek help from the experts here.

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
Slow SQL screenshot

storage.block-cache.capacity size
image

[Attachments: Screenshots/Logs/Monitoring]
This is a screenshot from Alibaba Cloud


These are some key logs (the full logs are in the file, which is at the bottom). If other logs and screenshots are needed, I can provide them as long as I have them.

2024-01-13 09:07:50 (UTC+08:00)TiDB 10.0.0.76:4000[2pc.go:1783] ["schemaLeaseChecker is not set for this transaction"] [conn=2175271786] [session_alias=] [sessionID=2175271786] [startTS=446983849933537288] [checkTS=446983849933537289]
2024-01-13 09:08:15 (UTC+08:00)TiDB 10.0.0.76:4000[2pc.go:1783] ["schemaLeaseChecker is not set for this transaction"] [conn=2175271770] [session_alias=] [sessionID=2175271770] [startTS=446983856474292227] [checkTS=446983856474292231]
2024-01-13 09:09:02 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=181.679616ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/pd/6963105917181191707/config\" "] [response="range_response_count:1 size:4921"] []
2024-01-13 09:09:02 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1823224743] range"] [detail="{range_begin:/pd/6963105917181191707/config; range_end:; response_count:1; response_revision:192676103; }"] [duration=261.92867ms] [start=2024/01/13 09:09:02.672 +08:00] [end=2024/01/13 09:09:02.934 +08:00] [steps="[\"trace[1823224743] 'range keys from in-memory index tree'  (duration: 107.11239ms)\",\"trace[1823224743] 'range keys from bolt db'  (duration: 71.409954ms)\"]"]
2024-01-13 09:09:03 (UTC+08:00)TiDB 10.0.0.76:4000[pd.go:152] ["get timestamp too slow"] ["cost time"=54.310922ms]
2024-01-13 09:09:04 (UTC+08:00)PD 10.0.0.76:2379[tso.go:334] ["clock offset"] [] [jet-lag=378.484356ms] [prev-physical=2024/01/13 09:09:03.820 +08:00] [now=2024/01/13 09:09:04.198 +08:00] [update-physical-interval=50ms]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=299.641743ms] [expected-duration=100ms] [prefix=] [request="header:<ID:13188664713108092049 > txn:<compare:<target:VALUE key:\"/pd/6963105917181191707/timestamp\" value_size:8 > success:<request_put:<key:\"/pd/6963105917181191707/timestamp\" value_size:8 >> failure:<>>"] [response=size:20] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[2018233445] linearizableReadLoop"] [detail="{readStateIndex:192720602; appliedIndex:192720601; }"] [duration=1.369837438s] [start=2024/01/13 09:09:04.336 +08:00] [end=2024/01/13 09:09:05.706 +08:00] [steps="[\"trace[2018233445] 'read index received'  (duration: 103.214821ms)\",\"trace[2018233445] 'applied index is now lower than readState.Index'  (duration: 1.266621552s)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=1.370092931s] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"health\" "] [response="range_response_count:0 size:7"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[171133557] range"] [detail="{range_begin:health; range_end:; response_count:0; response_revision:192676104; }"] [duration=1.370133565s] [start=2024/01/13 09:09:04.336 +08:00] [end=2024/01/13 09:09:05.707 +08:00] [steps="[\"trace[171133557] 'agreement among raft nodes before linearized reading'  (duration: 1.370042492s)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=1.370375971s] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"health\" "] [response="range_response_count:0 size:7"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1220459767] range"] [detail="{range_begin:health; range_end:; response_count:0; response_revision:192676104; }"] [duration=1.370461967s] [start=2024/01/13 09:09:04.337 +08:00] [end=2024/01/13 09:09:05.707 +08:00] [steps="[\"trace[1220459767] 'agreement among raft nodes before linearized reading'  (duration: 1.370341353s)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=997.964842ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"health\" "] [response="range_response_count:0 size:7"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1408137589] range"] [detail="{range_begin:health; range_end:; response_count:0; response_revision:192676104; }"] [duration=997.988085ms] [start=2024/01/13 09:09:04.709 +08:00] [end=2024/01/13 09:09:05.707 +08:00] [steps="[\"trace[1408137589] 'agreement among raft nodes before linearized reading'  (duration: 997.941582ms)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=559.380158ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/pd/6963105917181191707/gc/safe_point\" "] [response="range_response_count:1 size:79"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1513864565] range"] [detail="{range_begin:/pd/6963105917181191707/gc/safe_point; range_end:; response_count:1; response_revision:192676104; }"] [duration=559.454168ms] [start=2024/01/13 09:09:05.195 +08:00] [end=2024/01/13 09:09:05.755 +08:00] [steps="[\"trace[1513864565] 'agreement among raft nodes before linearized reading'  (duration: 511.642472ms)\",\"trace[1513864565] 'range keys from bolt db'  (duration: 47.677874ms)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=1.077325466s] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/topology/tidb/\" range_end:\"/topology/tidb0\" "] [response="range_response_count:6 size:1002"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1651274640] range"] [detail="{range_begin:/topology/tidb/; range_end:/topology/tidb0; response_count:6; response_revision:192676104; }"] [duration=1.077543358s] [start=2024/01/13 09:09:04.677 +08:00] [end=2024/01/13 09:09:05.755 +08:00] [steps="[\"trace[1651274640] 'agreement among raft nodes before linearized reading'  (duration: 1.029783981s)\",\"trace[1651274640] 'range keys from bolt db'  (duration: 47.497177ms)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=198.724618ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/pd/6963105917181191707/config\" "] [response="range_response_count:1 size:4921"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1889995612] range"] [detail="{range_begin:/pd/6963105917181191707/config; range_end:; response_count:1; response_revision:192676104; }"] [duration=198.813564ms] [start=2024/01/13 09:09:05.649 +08:00] [end=2024/01/13 09:09:05.847 +08:00] [steps="[\"trace[1889995612] 'agreement among raft nodes before linearized reading'  (duration: 58.764203ms)\",\"trace[1889995612] 'range keys from bolt db'  (duration: 139.893607ms)\"]"]
2024-01-13 09:09:06 (UTC+08:00)TiDB 10.0.0.76:4000[pd.go:152] ["get timestamp too slow"] ["cost time"=1.03219361s]
2024-01-13 09:09:06 (UTC+08:00)TiDB 10.0.0.76:4000[coprocessor.go:1330] ["[TIME_COP_WAIT] resp_time:489.624714ms txnStartTS:446983869764730881 region_id:228401 store_addr:10.0.0.74:20160 kv_process_ms:0 kv_wait_ms:0 kv_read_ms:0 processed_versions:0 total_versions:2 rocksdb_delete_skipped_count:0 rocksdb_key_skipped_count:0 rocksdb_cache_hit_count:8 rocksdb_read_count:0 rocksdb_read_byte:0"]
2024-01-13 09:09:09 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=123.162515ms] [expected-duration=100ms] [prefix=] [request="header:<ID:9606895634472091609 > put:<key:\"/tidb/server/minstartts/a69cb687-c1af-4963-aad7-5c4eb877835f\" value_size:18 lease:383523597613409156 >"] [response=size:7] []
2024-01-13 09:09:09 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=381.602581ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/pd/6963105917181191707/config\" "] [response="range_response_count:1 size:4921"] []
2024-01-13 09:09:09 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1628403812] range"] [detail="{range_begin:/pd/6963105917181191707/config; range_end:; response_count:1; response_revision:192676109; }"] [duration=381.669999ms] [start=2024/01/13 09:09:09.345 +08:00] [end=2024/01/13 09:09:09.727 +08:00] [steps="[\"trace[1628403812] 'agreement among raft nodes before linearized reading'  (duration: 381.529783ms)\"]"]
2024-01-13 09:09:09 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[407178525] linearizableReadLoop"] [detail="{readStateIndex:192720606; appliedIndex:192720605; }"] [duration=381.396625ms] [start=2024/01/13 09:09:09.345 +08:00] [end=2024/01/13 09:09:09.727 +08:00] [steps="[\"trace[407178525] 'read index received'  (duration: 51.278827ms)\",\"trace[407178525] 'applied index is now lower than readState.Index'  (duration: 330.117183ms)\"]"]
2024-01-13 09:09:10 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[291352191] linearizableReadLoop"] [detail="{readStateIndex:192720608;
| username: zhanggame1 | Original post link

The configuration is too low, a 4-core CPU is definitely not enough.

| username: 春风十里 | Original post link

Seeing the screenshot with tso_wait 2.02s
Then looking at the PD log
[2024/01/13 09:09:13.519 +08:00] [Warn] [tso.go:334] [“clock offset”] [jet-lag=697.965522ms] [prev-physical=2024/01/13 09:09:12.821 +08:00] [now=2024/01/13 09:09:13.518 +08:00] [update-physical-interval=50ms]
[2024/01/13 09:09:14.449 +08:00] [Warn] [tso.go:334] [“clock offset”] [jet-lag=158.025296ms] [prev-physical=2024/01/13 09:09:14.291 +08:00] [now=2024/01/13 09:09:14.449 +08:00] [update-physical-interval=50ms]
[2024/01/13 09:17:28.917 +08:00] [Warn] [grpclog.go:60] [“grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"”]
[2024/01/13 09:17:28.923 +08:00] [Warn] [grpclog.go:60] [“grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"”]

It jumps directly from 09:09 to 09:17, and the TiDB log is the same. Is the machine 10.0.0.76 a mixed deployment of PD, TiKV, and TiDB?
It is suspected that the memory usage of a certain component or SQL increased, causing the memory usage to reach 100%, which led to the suspension of PD and TiKV.
It is recommended to check the memory configuration
TiDB Memory Control Documentation | PingCAP Documentation Center
TiKV Memory Parameter Performance Tuning | PingCAP Documentation Center

| username: FutureDB | Original post link

Check the log directory under the TiDB server. There is a file that records OOM logs. See if there are any OOM SQL statements during that time period.

| username: 江湖故人 | Original post link

Could you expand on the green and blue SQL statements?

| username: 江湖故人 | Original post link

Also, take a screenshot of the “Host Information” in the “Cluster Information” section.

| username: wangccsy | Original post link

Got it.

| username: 哈喽沃德 | Original post link

Is your CPU shared or dedicated?

| username: dba远航 | Original post link

View SQL or large transactions

| username: Kongdom | Original post link

You can follow your thought process to check the execution plan of the top SQL and post it to see if there is any room for optimization.

| username: 小龙虾爱大龙虾 | Original post link

4 cores can easily be fully utilized.

| username: 像风一样的男子 | Original post link

The configuration is too low, and just a few slow queries will exhaust the resources.

| username: 路在何chu | Original post link

It feels like your issue is purely due to high concurrency, too many slow logs, and insufficient CPU.

| username: 路在何chu | Original post link

Did you check the load average with top to see if it is very high?

| username: tidb菜鸟一只 | Original post link

Please share your topology diagram. It seems that PD, TiDB server, and TiKV server are deployed together, and the TiDB server is consuming all the physical memory, causing PD to freeze and become unresponsive. This resulted in the entire cluster being down from 9:09 to 9:17. If resources are limited, I suggest setting memory limits for TiDB as well. For OOM, choose to kill the timed-out SQL directly. The memory block-cache.capacity% limit for TiKV server is 5G, which can be reduced to 3G.

| username: 有猫万事足 | Original post link

A 2-second tsowait is unacceptable.

This means that PD has almost no CPU time to provide TSO. The entire TiDB system heavily relies on PD’s TSO service. If this TSO service slows down, all services slow down.

At the very least, PD and TiKV should be separated. Although TiDB also consumes CPU, it is not as extreme as TiKV.
It is best to deploy PD separately.

| username: 麻烦是朋友 | Original post link

Take a look at the execution plan: check if sorting can utilize the ordered characteristics of the index. If not, consider creating a composite index.

SELECT
	record.license,
	record_type,
	record.park_name,
	image,
	in_time,
	out_time,
	record.money AS 'recordMoney',
	record.create_time AS 'createTime',
	record.update_time AS 'updateTime',
	record_status,
	record.park_id AS 'parkId',
	record.member_id,
	record.order_id,
	record.NO AS 'no',
	record.remark AS 'remark',
	record.`index`,
	record.id,
	pay_type,
	pay_no,
	pay_time,
	payment,
	porder.money,
	voucher_id,
	integral_id,
	invoice_id,
	discount_money,
	discount_voucher,
	discount_card,
	discount_integral,
	ppd.device_sn AS 'roadsSn',
	ppd.ip AS 'roadsIp',
	ppd.type AS 'roadsType',
	ppd.`name` AS 'roadsName',
	ps.`code`,
	ps.device_sn AS 'deviceSn',
	(
	SELECT
		group_concat(pi.url, '#', pi.type SEPARATOR ',') 
	FROM
		parking_image AS pi 
	WHERE
		pi.group_id = record.id 
		AND pi.is_deleted = 0 
	) AS img_url 
FROM
	parking_record record
	LEFT JOIN parking_pay_order porder ON record.order_id = porder.id 
	AND porder.`status` = 1
	LEFT JOIN parking_plc_device ppd ON record.plc_id = ppd.id
	LEFT JOIN parking_seats ps ON record.seats_id = ps.id 
WHERE
	record.is_deleted = 0 
	AND record.record_status = '1' 
	AND record.park_id = 1706920397874491394 
	AND record.tenant_id = '928703' 
	AND record.plc_id = 1706920676269858818 
	AND record.seats_id = 1743170094616018945 
ORDER BY
	record.in_time DESC 
	LIMIT 20
| username: 这里介绍不了我 | Original post link

The configuration does seem to be insufficient.

| username: zhang_2023 | Original post link

The CPU might not be sufficient.

| username: andone | Original post link

Start with the TiDB Dashboard and use monitoring to identify and optimize slow SQL queries.