Analysis of Slow Query Causes in TiDB v6

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb-v6 慢查询原因分析

| username: 信仰在空中飘扬

[TiDB Usage Environment] Production Environment
[TiDB Version] tidb-v6.1.3
[Reproduction Path] Upgraded from 5.4.2 to 6.1.3
[Encountered Problem: Problem Phenomenon and Impact]
A business SQL request volume fluctuates, causing a sudden increase in TiKV CPU. Manually checking the SQL execution plan shows the correct index is used, and the execution speed is 0.05s. However, a large number of slow queries over 0.5s are seen in the slow query log, with a large wait_time field. How to analyze the specific performance bottleneck?

[Attachments: Screenshots/Logs/Monitoring]

  1. OverView Monitoring



  1. Previously, the GC retention time was relatively long at 48h. After changing it to 20min, the jitter phenomenon still occurs.

  2. Slow Query Information

MySQL [INFORMATION_SCHEMA]> select * from CLUSTER_SLOW_QUERY where time >'2023-01-16 14:50:01' and query like 'SELECT * FROM `tbl_device_phone_log`%' limit 1\G
*************************** 1. row ***************************
                     INSTANCE: 10.105.128.254:10094
                         Time: 2023-01-16 14:50:01.004791
                 Txn_start_ts: 438790206288560132
                         User: kunpengdm_app
                         Host: 10.108.83.77
                      Conn_ID: 7677287164388118587
             Exec_retry_count: 0
              Exec_retry_time: 0
                   Query_time: 0.883878422
                   Parse_time: 0
                 Compile_time: 0.000522455
                 Rewrite_time: 0.000089441
           Preproc_subqueries: 0
      Preproc_subqueries_time: 0
                Optimize_time: 0.000326578
                      Wait_TS: 0.000014092
                Prewrite_time: 0
    Wait_prewrite_binlog_time: 0
                  Commit_time: 0
           Get_commit_ts_time: 0
          Commit_backoff_time: 0
                Backoff_types: 
            Resolve_lock_time: 0
        Local_latch_wait_time: 0
                   Write_keys: 0
                   Write_size: 0
              Prewrite_region: 0
                    Txn_retry: 0
                     Cop_time: 1.6118884690000002
                 Process_time: 0.407
                    Wait_time: 10.181
                 Backoff_time: 0
                LockKeys_time: 0
                Request_count: 310
                   Total_keys: 12843
                 Process_keys: 12292
 Rocksdb_delete_skipped_count: 0
    Rocksdb_key_skipped_count: 1714
Rocksdb_block_cache_hit_count: 99926
     Rocksdb_block_read_count: 0
      Rocksdb_block_read_byte: 0
                           DB: kunpeng_dms_call
                  Index_names: [tbl_device_phone_log:idx_deviceid_code]
                  Is_internal: 0
                       Digest: 3c45957328e5e5c016cb360b20cbab5d94376adcd0bac898eff708c220567f2a
                        Stats: tbl_device_phone_log:438789586999574542
                 Cop_proc_avg: 0.001312903
                 Cop_proc_p90: 0.002
                 Cop_proc_max: 0.004
                Cop_proc_addr: 10.105.128.115:20175
                 Cop_wait_avg: 0.032841935
                 Cop_wait_p90: 0.116
                 Cop_wait_max: 0.622
                Cop_wait_addr: 10.105.128.115:20175
                      Mem_max: 194849
                     Disk_max: 0
                     KV_total: 10.704811784
                     PD_total: 0.000005371
                Backoff_total: 0
     Write_sql_response_total: 0.000003124
                  Result_rows: 1
               Backoff_Detail: 
                     Prepared: 1
                         Succ: 1
                IsExplicitTxn: 0
            IsWriteCacheTable: 0
              Plan_from_cache: 0
            Plan_from_binding: 0
             Has_more_results: 0

Plan Screenshot

tikv-details Monitoring: coprocessor part abnormal
tidb-kunpeng_dms-TiKV-Details_2023-01-16T08_03_13.545Z.json (30.5 MB)

dashboard Screenshot




How to interpret this phenomenon and how to handle it?

| username: xfworld | Original post link

You can check the health of this table; the values in the execution plan here seem a bit off.

What was running at this time? Does it need to read so much data…

| username: 信仰在空中飘扬 | Original post link

The manually checked execution plan and table structure are as follows:

The health of the table is as follows:

Manual execution

At that time, I saw over 2000 requests within a few minutes in the slow log, corresponding to the topsql on the dashboard as shown below:


My judgment is that the large number of SQL requests related to tbl_device_phone_log caused the coprocessor to be quite busy, but I can’t analyze the specific principle.

I don’t know if it’s related to the two-day GC I set earlier, which might have caused too many historical versions to be scanned. I’ll run it overnight and check the monitoring again tomorrow.

| username: xfworld | Original post link

This table basically won’t fetch data from the cache, with a miss rate of…


Also, there are many versions of single data, requiring multiple reads to get the result…

Refer to the cumulative number of cop requests~

It is recommended to continue observing…

| username: 信仰在空中飘扬 | Original post link

Okay, thank you very much. I’ll take another look tomorrow. I see there haven’t been any major fluctuations in the last three hours.

| username: 胡杨树旁 | Original post link

Could you please explain what the maximum visible number and the maximum encountered version number represent?

| username: 信仰在空中飘扬 | Original post link

From the monitoring, there were very few fluctuations after 7 PM last night. Around 11 PM, there was a sudden surge in requests, and the jitter started again.

There are still jitter phenomena observed this morning.

A slow query situation this morning

This time it seems to have read from the cache, but it’s still slow

Also, regarding Hu Yang’s question:
What do the maximum visible number and the maximum encountered version number represent respectively?

I have set the gc_life_time to 20m, why are there still so many versions?

| username: xfworld | Original post link

Visible versions and invisible versions depend on whether they are valid.

Visible versions are valid data, while invisible versions are invalid data (pending cleanup).

When combined, they represent all version information.

When actually reading, it will first try to fetch from the cache, which is the fastest. If it cannot be fetched from the cache, it will need to scan the disk. In fact, according to the Key + version = MVCC processing mechanism, there will be sorting.

So it won’t read everything at once; there will be range divisions. However, the data read is not necessarily the latest; it will choose the valid version and search for valid version data within a range. This way, it evolves into the number of versions encountered (because pending cleanup versions will also be included).

On the other hand, I suggest you check for hotspot issues. Is the data not sufficiently dispersed, causing a single node instance to bear more request processing?


The effectiveness of GC time adjustment and data version cleanup are not synchronized tasks.

If you are in a hurry and want to clean up versions, you can manually clean them using the tikv-cli command. It is recommended to execute this during business downtime to avoid causing jitter.

| username: 信仰在空中飘扬 | Original post link

For the hotspot issue, you can see from the monitoring graph I posted above that during the jitter, the CPU usage of all TiKV nodes spiked.


No more operations before the end of the year, we’ll look into it after the holidays. Thanks!

| username: 胡杨树旁 | Original post link

Is it that the data distribution is not very good, and when querying under certain conditions, the data volume is relatively large, which may consume more resources, while when the data volume that meets the conditions is small, the consumption is less?

| username: dbaspace | Original post link

The issue was caused by unreasonable SQL query indexing. After adding proper indexes, it stabilized. Author’s monitoring graph:
image

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.