High CPU Usage Issue in TiKV Node Unified Read Pool

translator_bot · June 21, 2024, 10:32am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 节点 unified read pool CPU轮流高问题

| username: 小老板努力变强

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.3
[Reproduction Path]
[Encountered Problem: Phenomenon and Impact]
After upgrading from v5.1.4 to v6.5.3, the query tasks in the early morning have high latency. Both the slow query logs and monitoring show that the CPU of the unified read pool in TiKV is fully utilized. However, not all TiKV instances experience high CPU usage simultaneously; instead, it occasionally spikes on certain machines.

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

Restart TiKV, ignoring resource usage for now.
[Attachments: Screenshots/Logs/Monitoring]

KV nodes take turns with high CPU usage.

Coprocessor

The unified read pool is configured with 12 threads, all of which are fully utilized.

The SQL statement has been confirmed to use the index in the execution plan. It returns quickly during working hours, but it gets stuck for a long time when running tasks in the early morning. The analyze table command was executed on 12/13, and TiKV was reloaded on 12/13 to see if there is any improvement on the early morning of 12/14.

translator_bot · June 21, 2024, 10:32am

| username: WalterWj | Original post link

Open the dashboard and check the top SQL during that period to see what the top SQL for this node is. It looks like a read hotspot.

translator_bot · June 21, 2024, 10:32am

| username: 裤衩儿飞上天 | Original post link

Was there any batch processing during that time period? Check the disk I/O.

translator_bot · June 21, 2024, 10:32am

| username: 小龙虾爱大龙虾 | Original post link

That’s it, just optimize the SQL.

translator_bot · June 21, 2024, 10:32am

| username: dba远航 | Original post link

Check the LEADER of TIKV, it might be unbalanced or there might be a query hotspot causing the issue.

translator_bot · June 21, 2024, 10:32am

| username: 饭光小团 | Original post link

PS: The original poster and I are in the same department. Last night, we enabled the readpool.unified.auto-adjust-pool-size switch, and the performance improved significantly. Our analysis suggests that this is because the SQL queries are mostly concurrent and identical:

SELECT `t1`.`id`, `t1`.`ber`, `t1`.`type` FROM `ber` AS `t1` WHERE (`t1`.`id` > 6379887) ORDER BY `t1`.`id` LIMIT 1000;

Therefore, they should all be executed on the same KV, leading to a hotspot issue. Today, we are setting up Follower Read to see if it can alleviate the problem.

translator_bot · June 21, 2024, 10:32am

| username: 有猫万事足 | Original post link

tidb7.5.
I tried the following on a partitioned table with 180 million rows:

WHERE (t1.id > 6379887) ORDER BY t1.id LIMIT 1000

The execution plan is as shown above, and the speed is quite fast. It shouldn’t cause a large-scale scan on TiKV, which would lead to the unified read pool CPU being fully utilized.

I feel that if your SQL execution is slow or scans a lot of data, it might still be an issue with the execution plan.

translator_bot · June 21, 2024, 10:32am

| username: 饭光小团 | Original post link

There is no problem just looking at the execution plan. During the time when the issue occurred, we also saw this execution plan.

translator_bot · June 21, 2024, 10:32am

| username: 小龙虾爱大龙虾 | Original post link

You can’t just make guesses. Go to the slow query log and look for the SQL statements with a high number of key scans around the time of the issue.

translator_bot · June 21, 2024, 10:32am

| username: h5n1 | Original post link

Check the TPS and QPS for that time period.