TiDB SQL Query Long Tail Requests: Over 99% of Requests Complete in Milliseconds, but Some Take 5-10 Seconds to Finish

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb sql查询出现长尾请求,超过百分之九十九的请求都是毫秒执行结束,但有部分需要5-10秒才执行结束

| username: TiDBer_AxdzSBqt

When querying TiDB, some requests are particularly slow. I checked the execution plan, and the difference is here:
{num: 8, max: 7.05s, min: 1.26ms, avg: 885ms, p95: 7.05s, max_proc_keys: 9184, p95_proc_keys: 9184, tot_proc: 7.07s, rpc_num: 8, rpc_time: 7.08s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 80}

{num: 4, max: 3.28ms, min: 1.25ms, avg: 1.89ms, p95: 3.28ms, max_proc_keys: 2016, p95_proc_keys: 2016, tot_proc: 3ms, rpc_num: 4, rpc_time: 7.47ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 80}

After comparing the fast and slow SQL executions, I found that the fast executions generally use 4 threads when reading TiKV, while the slow ones use 8 threads. Is there a speculative execution or retry mechanism?

| username: h5n1 | Original post link

The different cop num indicates that the amount of requested data is also different. It would be best if you could post the complete execution plans for both SQL queries. Additionally, you can use trace to see which region is causing the slowdown.

| username: yulei7633 | Original post link

You can check if the slowness is caused by the collection of statistics. Take a look at the monitoring graph of tikv-detail-Coprocessor Overview. Your table is a partitioned table with quite a few partitions. Most of the reasons are related to statistics, as the collection of statistics consumes resources, leading to temporary resource occupation and waiting. I have also encountered a similar phenomenon: my monitoring graph is as follows:

| username: yulei7633 | Original post link

Here’s another monitoring graph below. You can check if yours looks similar to this. Mine looks like this:

| username: TiDBer_AxdzSBqt | Original post link

Thank you for the response. I noticed that my situation is different from yours. In my case, the select operation takes a long time, while collecting statistical information does not take much time.

| username: TiDBer_AxdzSBqt | Original post link

The execution plans are almost the same. The difference lies in what I posted here. This issue only appears with a large number of interface requests, making it difficult to reproduce manually.

| username: 人如其名 | Original post link

Is it really that difficult to post a complete execution plan? The problem itself is already very complex, and then providing so little information makes it extremely difficult to make a judgment. At the very least, you should provide a complete execution plan for the poor performance. If you are a bit more generous, also post the complete execution plan for the good performance.

| username: Kongdom | Original post link

Is it the same SQL?

If the same SQL is sometimes fast and sometimes slow, you might want to consider whether there are hotspots or if the cluster is busy during slow queries.

If it’s not the same SQL, it is recommended to upload the complete execution plan of the slow SQL. Sensitive information can be redacted.

| username: TiDBer_AxdzSBqt | Original post link

The SQL is the same, and the execution plans are exactly the same. If they were different, I wouldn’t be asking. I also suspect that the TiKV might be busy, but looking at the monitoring, it’s not much different from usual.

| username: Kongdom | Original post link

Does that mean there is a hotspot?

| username: TiDBer_AxdzSBqt | Original post link

From the traffic graph, it appears that there is a hotspot, likely due to batch tasks on our end. I also suspect it is related to IO impact.

| username: heiwandou | Original post link

Learn a bit.

| username: Kongdom | Original post link

:thinking: Personally, I think the probability of it being a hotspot is a bit higher.