Analysis and Assistance on the Cause of Sudden Increase in TiKV Node CPU Usage, Prolonged SQL Execution Time Leading to Business Timeout, and Decrease in QPS

translator_bot · June 23, 2024, 2:20am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点cpu使用率突然变高，sql耗时明显变长造成业务超时，同时qps下降此问题原因分析协助

| username: heloong

[TiDB Usage Environment] Production Environment / Testing / Poc
Production Environment
[TiDB Version]
v5.0.6
[Encountered Problem]
2022/10/08 15:56:00 ～ 16:28:00, a large number of timeout errors reported by online applications
[Reproduction Path] Operations performed that caused the problem
Not reproduced, self-recovered
[Problem Phenomenon and Impact]
During this time period, checking the TiDB dashboard, all SQL execution times were several times slower than usual, not just individual SQLs, and QPS dropped significantly. Further checking Grafana, the CPU usage of all TiKV nodes was close to 80%, with some exceeding 80% and triggering alerts.
[Attachment]
On Alibaba Cloud, the TiKV nodes are 8c64g local disks, deployment information as follows:

All parameter configurations use default values without adjustments.

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

translator_bot · June 23, 2024, 2:20am

| username: 张雨齐0720 | Original post link

Check the monitoring to find slow SQL queries and see if they are causing the issue.

translator_bot · June 23, 2024, 2:20am

| username: heloong | Original post link

Through analysis, the disk latency and network latency of the TiKV nodes are normal. Some other monitoring information is as follows:

translator_bot · June 23, 2024, 2:20am

| username: heloong | Original post link

First of all, looking at this, all SQLs have slowed down, indicating that there is an overall system issue.

translator_bot · June 23, 2024, 2:20am

| username: 大鱼海棠 | Original post link

Judging by the monitoring, it should be a slow query issue, as the unified read pool CPU has significantly increased. Additionally, the overall machine load should also be checked.

translator_bot · June 23, 2024, 2:20am

| username: OnTheRoad | Original post link

Write Stall?

translator_bot · June 23, 2024, 2:20am

| username: heloong | Original post link

There is no write stall, and nothing can be found in the logs at that time, nor in the monitoring.

The write latency has indeed increased significantly, and there were also screenshots earlier showing a significant increase in read latency.

translator_bot · June 23, 2024, 2:20am

| username: OnTheRoad | Original post link

Are there any clues at the system level?
How is the health of the table?
Are there any scheduled tasks running that are affecting TiDB?