Index Query Jitter Issue

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 索引查询抖动问题

| username: Hacker_5KEgzcj2

【TiDB Usage Environment】Production Environment
【TiDB Version】V7.1.1
【Encountered Issue: Phenomenon and Impact】

  1. Index scan hits, scanning between 14-700 rows, but occasionally execution time fluctuates from tens of milliseconds to several hundred milliseconds.



Execution Plan

id task estRows operator info actRows execution info memory disk
IndexLookUp_10 root 16.03 411 time:217.4ms, loops:2, index_task: {total_time: 126.2ms, fetch_handle: 126.2ms, build: 980ns, wait: 3.75µs}, table_task: {total_time: 91ms, num: 1, concurrency: 5}, next: {wait_index: 126.3ms, wait_table_lookup_build: 497.5µs, wait_table_lookup_resp: 90.4ms} 94.5 KB N/A
├─IndexRangeScan_8(Build) cop[tikv] 16.03 table:qa_wanliu_order_finished, index:idx_driver_id_finish_time(driver_id, finish_time), range:[71933 2023-12-05 00:00:00,71933 2023-12-15 23:00:00], keep order:false 411 time:126.2ms, loops:3, cop_task: {num: 2, max: 65.1ms, min: 60.9ms, avg: 63ms, p95: 65.1ms, max_proc_keys: 224, p95_proc_keys: 224, tot_proc: 820.9µs, tot_wait: 124.5ms, rpc_num: 2, rpc_time: 126ms, copr_cache_hit_ratio: 0.00, build_task_duration: 26.3µs, max_distsql_concurrency: 1}, tikv_task:{proc max:0s, min:0s, avg: 0s, p80:0s, p95:0s, iters:6, tasks:2}, scan_detail: {total_process_keys: 411, total_process_keys_size: 48498, total_keys: 413, get_snapshot_time: 70.2ms, rocksdb: {key_skipped_count: 411, block: {cache_hit_count: 27}}} N/A N/A
└─TableRowIDScan_9(Probe) cop[tikv] 16.03 table:qa_wanliu_order_finished, keep order:false 411 time:90.4ms, loops:2, cop_task: {num: 34, max: 85.9ms, min: 2.14ms, avg: 15.5ms, p95: 56ms, max_proc_keys: 19, p95_proc_keys: 18, tot_proc: 22.5ms, tot_wait: 456.7ms, rpc_num: 34, rpc_time: 524.9ms, copr_cache_hit_ratio: 0.00, build_task_duration: 142.6µs, max_distsql_concurrency: 1, max_extra_concurrency: 14}, tikv_task:{proc max:31ms, min:0s, avg: 1.65ms, p80:1ms, p95:3ms, iters:34, tasks:34}, scan_detail: {total_process_keys: 411, total_process_keys_size: 120631, total_keys: 425, get_snapshot_time: 274ms, rocksdb: {key_skipped_count: 42, block: {cache_hit_count: 4352}}} N/A N/A

Overall QPS at the time

TIKV Monitoring

【Resource Configuration】
7T NVMe, 48 hyper-threads, 125G memory

| username: tidb狂热爱好者 | Original post link

It looks like the statistics are outdated.

| username: tidb狂热爱好者 | Original post link

| username: Hacker_5KEgzcj2 | Original post link

But I just checked the basic information and found that the statistics are accurate.


| username: andone | Original post link

Is it possible that during the jitter, the TiKV node is scheduling regions? You can check the monitoring to see.

| username: Hacker_5KEgzcj2 | Original post link

After reaching a certain scale, there will be a large area with hundreds of milliseconds of delay. Yesterday, when this SQL was at 1500 QPS, the delay increased many times.

| username: tidb狂热爱好者 | Original post link

You are just wasting time waiting. You don’t even know how to read the monitoring.

| username: tidb狂热爱好者 | Original post link

The database is not just yours, there are other SQL queries running as well.

| username: Hacker_5KEgzcj2 | Original post link

Indeed, I’m not very good at it and have always been figuring it out on my own. Which monitoring metrics should I look at?

| username: Hacker_5KEgzcj2 | Original post link

This is the result of yesterday’s stress test. The SQL is controlled by us and the business, and we only increased the traffic for this particular SQL, while the traffic for other SQLs remained unchanged. Then, a large-scale delay occurred.

| username: 像风一样的男子 | Original post link

The increase in SQL latency is just the result; when QPS increases, the latency of the entire server will rise. You can check the SQL statement analysis to see which SQL statements consume the most resources.

| username: Hacker_5KEgzcj2 | Original post link

Latency changes

Statement analysis

| username: 小龙虾爱大龙虾 | Original post link

Check the TiKV read thread pool monitoring.

| username: Hacker_5KEgzcj2 | Original post link

Are these the ones?

| username: 小龙虾爱大龙虾 | Original post link

The read thread pool pressure is a bit high.

| username: forever | Original post link

Is there a problem with the network?

| username: Jellybean | Original post link

The image clearly shows that the execution of this SQL statement takes the longest time in the Coprocessor process.

Coprocessor refers to the operation of reading data from the TiKV storage layer during the execution of an SQL statement. The fact that this operation takes the longest time indicates that the most critical bottleneck is the prolonged time taken to read data from TiKV.

Focus on which TiKV nodes are primarily involved in the Coprocessor data reading process. This can be viewed on the Dashboard. Then, check the corresponding TiKV’s IO and Region scheduling status on the TiKV -details Grafana monitoring panel to see if there are any obvious anomalies.

| username: Jellybean | Original post link

Your chart shows that the Unified Read Pool CPU is very high, exceeding the warning line, which also indicates that there is significant pressure on data reading in TiKV.

Please confirm whether there were other large SQL queries executing and reading a large amount of data during this period, causing the cluster to slow down.

| username: Jellybean | Original post link

You can refer to the ideas in these two articles and specifically check the TiKV nodes with high read pressure to see if there are any read hotspot issues.

| username: dba远航 | Original post link

When jitter occurs, are there any other slow SQL queries in the database? Unless you are testing in an isolated environment without any other interference.