[TiDB Usage Environment] Production Environment
[TiDB Version] 4.0
[Encountered Problem: Phenomenon and Impact]
One TiKV node in the production environment triggered a CPU alarm NODE_cpu_used_more_than_80%. The Grafana screenshots are as follows. I am a novice, could the experts guide me on how to troubleshoot what is causing the CPU alarm and how to pinpoint the SQL responsible? The issue occurred during a time period when there were usually no problems, and no new deployments were made.
[Resource Configuration] Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
For version 4.0.2, you can grep -i “expensive” tidb.log; then manually check the logs around the time of the issue to see which SQL statement looks suspicious.
Expensive records are not slow SQL, but rather expensive SQL. The definition of expensive is that the query time or the number of rows queried exceeds a defined threshold. From the perspective of filtering SQL, it looks at which statements might be more CPU-intensive.
SELECT FLOOR(UNIX_TIMESTAMP(MIN(summary_begin_time))) AS agg_begin_time,
FLOOR(UNIX_TIMESTAMP(MAX(summary_end_time))) AS agg_end_time,
ANY_VALUE(digest_text) AS agg_digest_text,
ANY_VALUE(digest) AS agg_digest,
SUM(exec_count) AS agg_exec_count,
SUM(sum_latency) AS agg_sum_latency,
MAX(max_latency) AS agg_max_latency,
MIN(min_latency) AS agg_min_latency,
CAST(SUM(exec_count * avg_latency) / SUM(exec_count) AS SIGNED) AS agg_avg_latency,
CAST(SUM(exec_count * avg_mem) / SUM(exec_count) AS SIGNED) AS agg_avg_mem,
MAX(max_mem) AS agg_max_mem,
ANY_VALUE(schema_name) AS agg_schema_name,
ANY_VALUE(plan_digest) AS agg_plan_digest,
query_sample_text,
index_names
FROM `INFORMATION_SCHEMA`.`CLUSTER_STATEMENTS_SUMMARY_HISTORY`
WHERE index_names IS NULL
AND query_sample_text > ''
GROUP BY schema_name, digest
ORDER BY agg_sum_latency DESC
LIMIT 1;
Confirm the alarm time period: First, confirm the exact time period of the alarm and compare it with the relevant system logs or monitoring data. Understand the system load situation when the alarm occurred.