Exceptionally slow SQL queries during idle times are fast when queried again

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 业务闲时出现异常慢查询sql语句拎出来查询又很快

| username: zhimadi

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.2
[Reproduction Path] Operations performed that led to the issue
There is a scheduled task at 3:41 AM that generates a statistical report and writes the results into table t1. Before writing, it clears the t1 table. This task runs daily and has been running for two to three years without any issues. Today, I specifically checked the logs, and the insert records are not particularly many, only 2,800 records.

[Encountered Problem: Problem Phenomenon and Impact]
There are unusually slow queries during off-peak business hours, and these SQL statements return in seconds when executed during peak business hours. How should I troubleshoot the cause? Which dashboard metrics and parameters should I look at to pinpoint the issue?

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

Slow Query:


| username: tidb菜鸟一只 | Original post link

Check the dashboard to see which part is slow.

| username: zhanggame1 | Original post link

Just because the sentences look the same doesn’t mean the data volume is the same. Check the data volume retrieved at different points in time.

| username: zhimadi | Original post link

Which part is slow? I don’t quite understand, can you be more specific?

| username: zhimadi | Original post link

The same 3 to 4 SQL statements for the same business are slow. The amount of data retrieved at different times is about the same, querying the inventory table (a total of 5 million rows) and associating the product table (a total of 800,000 rows). Moreover, the returned records may be just a few to 2,000 (if only querying for half a year).

| username: Kongdom | Original post link

There is a situation we encountered before, where some slow queries are not the cause but the result. It is some data processing that causes the cluster to respond slowly, which in turn causes these queries that are not slow to become slow queries. It is recommended to investigate by time periods, not just those few slow query statements.

| username: zhimadi | Original post link

Our system often encounters this :joy: It occasionally gets stuck, and then some SQL queries using primary keys as conditions show up as very slow. We don’t know how to solve it. But this time it shouldn’t be the case, we’ve checked both forward and backward.

| username: tidb菜鸟一只 | Original post link

Click on it, then select the execution time option to see exactly where it is slow.

| username: zhimadi | Original post link



| username: 人如其名 | Original post link

The cop_task tasks are backlogged, optimize other slow queries during the same period.

| username: zhimadi | Original post link

But when I looked at the dashboard earlier, there were no slow query statements appearing.

| username: tidb菜鸟一只 | Original post link

Your first point is that the SQL query took 15 seconds to run? Check the performance of this SQL query at other times.
The second SQL query has a long TiKV coprocessor time, which could be due to high TiKV load or network pressure at that time.
You can manually execute the SQL query with EXPLAIN ANALYZE to see which part is currently consuming time.

| username: zhimadi | Original post link

The first one takes about 0.14 seconds to execute at other times, and these SQL queries are almost the same. At that time, there was a task batch inserting report data with 2800 records. The table only has a few million records, and it shouldn’t be that slow even without using an index under normal circumstances.

| username: 裤衩儿飞上天 | Original post link

Please share the heatmap from that time point.

| username: zhimadi | Original post link

Heatmap, is this it?

| username: 裤衩儿飞上天 | Original post link

  1. Can you take a look at this execution plan? Just find one that takes around 15.1 seconds.

  2. Is the yellow line in the red box your business table? Hovering the mouse over it will display the table information; (you can separately check the read hotspot).

| username: ealam_小羽 | Original post link

Combine the heatmap to check the IO. Previously, I encountered large table IO causing blockage of other normal queries, which also resulted in long coprocessor times or network issues.

| username: zhanggame1 | Original post link

With Grafana installed, you can check the CPU and IO load of TiKV nodes during peak and off-peak times.

| username: redgame | Original post link

We encountered an issue with uneven data distribution here once, somewhat similar.

| username: kavenab | Original post link

Print out the execution plans at different times to see if they are the same.
I guess there might be locks appearing intermittently. Could there be any maintenance tasks configured to run during idle times?