Intermittent High CPU Usage in TiFlash

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash cpu使用率间歇性高

| username: qiuxb

【TiDB Usage Environment】Production Environment or Test Environment or POC
v.4.0.8 Production Environment
【TiDB Version】
【Encountered Problem】
A large query was executed on TiFlash, causing the TiFlash node to restart. After restarting, the CPU usage remains very high.
Found some warnings in the error log: WARNING 2022.08.04 17:30:52.218460 [884327] CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): LockException: region 466440, message:

Currently unsure of the reason for the high TiFlash CPU usage??
【Reproduction Path】What operations were performed to encounter the problem
【Problem Phenomenon and Impact】

【Attachments】Related logs and monitoring (https://metricstool.pingcap.com/)


If the question is related to performance optimization or fault troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results for upload.

| username: forever | Original post link

Check which process is consuming the most CPU in the top command on the Tiflash node.

| username: qiuxb | Original post link

| username: Min_Chen | Original post link

Hello, could you check if there are continuous requests coming to TiFlash? You can monitor Clinic-Cluster Coprocessor Request QPS and Handling Request Number.

| username: alfred | Original post link

Check which processes are consuming high CPU and further trace it to the SQL.

| username: qiuxb | Original post link

Are you using perf top -p to see which internal processes consume the most CPU?

Using top, the %CPU indicator for the tiflash process is as high as 3000%.

Currently, some time-consuming SQL executions have been identified:

SELECT report_time, SUM(x1), SUM(xx2), ... (about 70 sum indicators)
WHERE report_time >= xxx AND report_time <= xxx
GROUP BY report_time;

Looking at the time range, each query covers one day, and the execution plan shows that the number of scanned rows is about 3.8 million to 4 million. The execution time is quite unstable, with some queries completing in around 4 seconds, while others take up to 100 seconds. It is unclear whether this is related to the issue.

| username: qiuxb | Original post link

After a period of time, there will be 800-1000 requests, and the two metrics correspond very well.

| username: qiuxb | Original post link

The image is not visible. Please provide the text you need translated.

| username: flow-PingCAP | Original post link

So the increase in CPU should be expected. The large difference in execution time needs further investigation; you can check the TiFlash Grafana monitoring and logs for any anomalies. However, your version 4.0.8 is quite outdated. You might consider upgrading to the latest version 5.4 or 6.1.

| username: qiuxb | Original post link

Is a request QPS of 1000 considered high? There are some warnings in the logs: WARNING 2022.08.04 17:30:52.218460 [ 884327 ] CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): LockException: region 466440, message:

| username: flow-PingCAP | Original post link

Each coprocessor query targets a single region. For AP requests, a request QPS of up to 1000 is indeed quite high. The optimization in version 5.4 will try to use batch requests (super_batch or MPP) as much as possible, significantly improving query efficiency.

| username: flow-PingCAP | Original post link

This LockException is expected. It occurs when reading and encountering an uncommitted transaction, resulting in this error. At this point, TiFlash will wait for the transaction to commit before starting the query, which is indeed a possible reason for the query slowdown. You can check the TiFlash summary monitoring of Coprocessor / Error QPS to see if there are many occurrences.

| username: neolithic | Original post link

First, you should start from the OS layer and identify the specific process before breaking it down.

| username: qiuxb | Original post link

I checked using top, and it shows that TiFlash is occupying a relatively high amount. Do you mean which TiFlash subprocess is using more CPU?

| username: qiuxb | Original post link

The highest Error QPS is 15, normally it’s 1-2, or even none, overall it’s very low.