Slow Concurrent Queries in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb并发查询慢

| username: TiDBer_AcAczQ2E

[TiDB Usage Environment] Production Environment / Testing / PCO
[TiDB Version] 5.7.25-TiDB-v6.1.0
[Encountered Problem]
SQL statement:

select
  T_58CD4F.`branch_name` as `__fcol_0`,
  sum(T_58CD4F.`number`) as `__fcol_1`
from (SELECT
    DISTINCT
        d.lines_name,
        d.branch_name,
        d.shop_name,
    di.delivery_id,
    di.sku,
        di.number
from (
        SELECT
                d.delivery_id,
                d.lines_name,
                b.`name` AS branch_name,
                s.`name` AS shop_name,
        d.erpsource
        FROM
                erp_sdb_wms_delivery d
                LEFT JOIN erp_sdb_ome_branch b ON b.branch_id = d.branch_id AND b.erpsource = d.erpsource
                LEFT JOIN erp_sdb_ome_shop s ON s.shop_id = d.shop_id AND s.erpsource = d.erpsource
        WHERE
                delivery_time BETWEEN UNIX_TIMESTAMP('2021-01-01') and  UNIX_TIMESTAMP(DATE_ADD('2022-08-01',INTERVAL 1 DAY ))
        ) d join ( SELECT delivery_id, bn AS sku, erpsource, number FROM erp_sdb_wms_delivery_items
     WHERE 1=1
     and bn in ('SOUFEEL_22','BZD008') ) di
          ON di.delivery_id = d.delivery_id and di.erpsource=d.erpsource
) as `T_58CD4F`
where 1 = 1
group by T_58CD4F.`branch_name`

[Problem Phenomenon and Impact]
Executing the SQL statement alone takes 7.3 seconds, but when multiple concurrent executions of this SQL statement occur, the time taken reaches 32 seconds. Observing the dashboard page, the CPU usage increases.


Please advise on how to improve query performance when facing multiple concurrent executions.

Slow log information:

# Time: 2022-09-07T11:50:36.185994767+08:00
# Txn_start_ts: 435820324708679681
# User@Host: root[root] @ 125.211.72.217 [125.211.72.217]
# Conn_ID: 3894492175815148323
# Query_time: 35.214570341
# Parse_time: 0.000344403
# Compile_time: 0.003411279
# Rewrite_time: 0.001141417
# Optimize_time: 0.001980015
# Wait_TS: 0.0000225
# Cop_time: 176.960387282 Process_time: 415.196 Wait_time: 269.759 Request_count: 2527 Process_keys: 4548617 Total_keys: 6660728 Rocksdb_delete_skipped_count: 932 Rocksdb_key_skipped_count: 5938109 Rocksdb_block_cache_hit_count: 30499722
# DB: bdata
# Index_names: [erp_sdb_wms_delivery:ind_delivery_time,erp_sdb_wms_delivery_items:ind_bn_status_code]
# Is_internal: false
# Digest: 3c42814a7dd1f443ae3da83f3924bc91f62b04865bacd82f2f3438be6f3c09c6
# Stats: erp_sdb_wms_delivery:435525279461408769,erp_sdb_ome_branch:pseudo,erp_sdb_wms_delivery_items:435803349122547715,erp_sdb_ome_shop:pseudo
# Num_cop_tasks: 2527
# Cop_proc_avg: 0.164303917 Cop_proc_p90: 0.607 Cop_proc_max: 2.341 Cop_proc_addr: 172.0.0.239:20160
# Cop_wait_avg: 0.106750692 Cop_wait_p90: 0.221 Cop_wait_max: 0.443 Cop_wait_addr: 172.0.0.239:20160
# Mem_max: 63699047
# Prepared: false
# Plan_from_cache: false
# Plan_from_binding: false
# Has_more_results: false
# KV_total: 699.761262391
# PD_total: 0.000010866
# Backoff_total: 0
# Write_sql_response_total: 0.000005886
# Result_rows: 3
# Succ: true
# IsExplicitTxn: false
| username: xiaohetao | Original post link

Where is the main slowdown during concurrency?

During concurrency, is the CPU, IO, and network traffic of each machine high?

| username: TiDBer_AcAczQ2E | Original post link

Only the CPU is high, currently TiDB is using a single node.

| username: xiaohetao | Original post link

When the CPU usage is high, is it TiDB that is high or KV that is high?

| username: TiDBer_AcAczQ2E | Original post link

Where can I see this? I only found the overall CPU chart.

| username: xfworld | Original post link

The query took a total of 35 seconds. Do you want to optimize this SQL or find out the reason why TiDB is slow?

  1. If you want to optimize the SQL, you need to look at the execution plan through explain sql and optimize it.

  2. Looking at the above record information, the slowness seems to be on the TiKV node; you can enable problem diagnosis methods to obtain information about the slow SQL.

  3. Use Prometheus to observe the resource execution status of the TiKV nodes in the cluster. Check if there are high CPU, high IO, or high memory situations at the same time.

Resource identification:

| username: TiDBer_AcAczQ2E | Original post link

The execution speed of a single SQL is normal, but when multiple clients execute this SQL simultaneously, the speed increases exponentially. Is there any way to solve this problem?

| username: xfworld | Original post link

How much slower will it become?

If the data volume is the same, will changing the structure make this process more stable?

There are many directions for optimization.

| username: alfred | Original post link

If it only runs slowly when the concurrency is high, it is likely due to resource shortages causing delays. You should further identify which type of resource (CPU, IO, memory) is insufficient, and also refer to the execution plan to optimize the SQL.

| username: TiDBer_AcAczQ2E | Original post link

With the increase in concurrency, the time grows exponentially. For example, executing a single SQL takes 12 seconds, executing two SQLs concurrently takes 16 seconds, and executing seven SQLs concurrently takes 46 seconds.

| username: TiDBer_AcAczQ2E | Original post link

[quote=“alfred, post:9, topic:933046”]
Resources

| username: xfworld | Original post link

What configuration and how much data?

| username: TiDBer_AcAczQ2E | Original post link

Plan

| username: xfworld | Original post link

A single node, mixed deployment? :joy:

| username: TiDBer_AcAczQ2E | Original post link

Hmm, a single node, but it shouldn’t be this slow. The query speed drops drastically just by executing a few SQL statements simultaneously.

| username: xfworld | Original post link

You should allocate more resources…
Not following the official documentation requirements at all, and then complaining about it being slow… You really know how to play…

| username: TiDBer_AcAczQ2E | Original post link

Isn’t a single machine enough? This is a test environment. By the way, the main table for the SQL query has millions of records, and the sub-table has tens of millions of records. The data volume isn’t particularly large.

| username: xfworld | Original post link

No significance :custard:

| username: TiDBer_AcAczQ2E | Original post link

The company’s budget isn’t that high :confounded: Can’t a single node handle it? :cold_sweat: It just needs to withstand a few dozen concurrent requests.

| username: OnTheRoad | Original post link

He has only one machine, running 1 TiDB, 1 PD, and 1 TiKV. It’s fortunate that he didn’t use TiFlash, otherwise, he might have to deploy a TiFlash as well. :sweat_smile: