How to Find the SQL Statements Consuming the Most CPU Resources in TiFlash?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 如何找到tiflash里最占cpu资源的sql?

| username: 华健-梦诚科技

Phenomenon Description:
In the online cluster, several TiFlash nodes suddenly have their CPUs fully utilized.
Based on experience, it should be a batch of bad SQL queries.

Question:
How can we find the SQL queries that consume the most CPU in TiFlash?
In the dashboard’s top SQL, only TiDB and TiKV nodes are visible.

| username: TiDBer_vfJBUcxl | Original post link

In the slow SQL log
Find the log file path using show variables like 'tidb_slow_query_file';

| username: TiDBer_vfJBUcxl | Original post link

【【High CPU Usage in TiDB Part 1】Scheduler Worker CPU】您的访问出错了

| username: tidb菜鸟一只 | Original post link

SELECT QUERY, query_time, stats
FROM information_schema.slow_query
WHERE Plan LIKE '%tiflash%'
ORDER BY Query_time DESC;

Take a look.

| username: 华健-梦诚科技 | Original post link

Thank you both for your help.
It’s difficult to identify slow SQL queries. Once the CPU usage spikes, all SQL queries become slow.
In a production environment with a QPS of over 100, it’s hard to manually identify issues. We can only rely on statistical information like Top SQL.

| username: 有猫万事足 | Original post link

It’s very difficult to find.
You can try using the SQL mentioned above to list all suspicious objects.
Then look into the execution plan field and find the actRows column with very large numbers.
The larger this number, the more resources it consumes, including CPU and memory.
However, I found that memory statistics might not be accurate; sometimes the memory is exhausted, but the statistics are still low.
But the common point is that both actRows and estRows numbers are very large. The larger they are, the more likely they are to cause issues.

| username: redgame | Original post link

Look for those with many return values and frequent executions.

| username: Jasper | Original post link

In fact, high CPU usage in TiFlash corresponds to scanning a large amount of data, which can be divided into two situations:

  1. Each scan does not involve much data, but the same type of SQL is executed many times, resulting in a large total amount of scanned data.
  2. Each execution involves scanning a large amount of data.

You can find the corresponding SQL through process_key and total_key in cluster_slow_query, and then use the execution plan to determine whether it was executed in TiKV or TiFlash.

You can refer to the following SQL (you need to modify the time range yourself):

SQL with the highest average scanned keys

select sum(total_keys)/count(digest) avg_keys, digest, min(query)
from CLUSTER_SLOW_QUERY
WHERE Time > '2022/02/12 12:00:00'
AND time < '2022/02/12 13:00:00'
group by digest
ORDER BY avg_keys DESC
LIMIT 10;

SQL with the most coprocessed keys within a period of time

select sum(Process_keys), digest, min(query) from CLUSTER_SLOW_QUERY
WHERE Time > '2022/02/12 12:00:00'
AND time < '2022/02/12 13:00:00'
group by digest
ORDER BY sum(Process_keys) DESC
LIMIT 10;
| username: WalterWj | Original post link

The new version of the dashboard has top SQL. Selecting a TiKV node will automatically pop up top SQL.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.