How to Locate SQL When NODE_cpu_used_more_than_80%

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: NODE_cpu_used_more_than_80% 如何定位到SQ LL

| username: jeff

[TiDB Usage Environment] Production Environment
[TiDB Version] 4.0
[Encountered Problem: Phenomenon and Impact]
One TiKV node in the production environment triggered a CPU alarm NODE_cpu_used_more_than_80%. The Grafana screenshots are as follows. I am a novice, could the experts guide me on how to troubleshoot what is causing the CPU alarm and how to pinpoint the SQL responsible? The issue occurred during a time period when there were usually no problems, and no new deployments were made.
[Resource Configuration] Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]



| username: Kongdom | Original post link

You can see the TOP SQL in the Dashboard.

| username: MrSylar | Original post link

For version 4.0.2, you can grep -i “expensive” tidb.log; then manually check the logs around the time of the issue to see which SQL statement looks suspicious.

| username: MrSylar | Original post link

Just curious, why is there only one tikv-server topology on Grafana? Is it running as a single instance?

| username: jeff | Original post link

Filtered above.

| username: MrSylar | Original post link

:grin: :grin:

| username: jeff | Original post link

There are too many in top, no obviously slow SQL.

| username: MrSylar | Original post link

Expensive records are not slow SQL, but rather expensive SQL. The definition of expensive is that the query time or the number of rows queried exceeds a defined threshold. From the perspective of filtering SQL, it looks at which statements might be more CPU-intensive.

| username: jeff | Original post link

Expensive, it only recorded 3 analyze operations. According to the Dashboard, these 3 operations were completed in 30 seconds at most.

| username: jeff | Original post link

The key is that the load on the other 2 TiKV nodes is not high.

| username: MrSylar | Original post link

Helpless, the 4.0 version does not have the top SQL feature, so there is no perfect way to directly locate the problem statement.

| username: tidb狂热爱好者 | Original post link

You can see it in the TOP SQL section of the Dashboard.

| username: tidb狂热爱好者 | Original post link

It’s obvious that this single SQL query wasn’t broken down, causing all the read pressure to be concentrated on a single KV.

| username: tidb狂热爱好者 | Original post link

SELECT FLOOR(UNIX_TIMESTAMP(MIN(summary_begin_time))) AS agg_begin_time, 
       FLOOR(UNIX_TIMESTAMP(MAX(summary_end_time))) AS agg_end_time, 
       ANY_VALUE(digest_text) AS agg_digest_text, 
       ANY_VALUE(digest) AS agg_digest, 
       SUM(exec_count) AS agg_exec_count, 
       SUM(sum_latency) AS agg_sum_latency, 
       MAX(max_latency) AS agg_max_latency, 
       MIN(min_latency) AS agg_min_latency, 
       CAST(SUM(exec_count * avg_latency) / SUM(exec_count) AS SIGNED) AS agg_avg_latency, 
       CAST(SUM(exec_count * avg_mem) / SUM(exec_count) AS SIGNED) AS agg_avg_mem, 
       MAX(max_mem) AS agg_max_mem, 
       ANY_VALUE(schema_name) AS agg_schema_name, 
       ANY_VALUE(plan_digest) AS agg_plan_digest,
       query_sample_text,
       index_names 
FROM `INFORMATION_SCHEMA`.`CLUSTER_STATEMENTS_SUMMARY_HISTORY` 
WHERE index_names IS NULL 
  AND query_sample_text > '' 
GROUP BY schema_name, digest 
ORDER BY agg_sum_latency DESC 
LIMIT 1;
| username: redgame | Original post link

Confirm the alarm time period: First, confirm the exact time period of the alarm and compare it with the relevant system logs or monitoring data. Understand the system load situation when the alarm occurred.

| username: zhanggame1 | Original post link

Check the top SQL in the dashboard for the CPU.

| username: 像风一样的男子 | Original post link

In version 4.0, you can only check from the slow query log. Upgrading the version makes it much more convenient.

| username: ealam_小羽 | Original post link

Version 4.0 should also have a Dashboard, you can check it out:

Or look at the slow logs and expensive query logs:
Slow logs: 慢查询日志 | PingCAP 归档文档站
Expensive query logs: 定位消耗系统资源多的查询 | PingCAP 归档文档站

| username: jeff | Original post link

Is the expert referring to the leader of the table not being dispersed?