TiDB Execution Time for SQL is Relatively Long Under Concurrent Conditions

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB执行在并发情况下sql执行时间比较长

| username: TiDBer_MCW4A4Ie

【TiDB Usage Environment】Production Environment
【TiDB Version】7.2
【Reproduction Path】SQL execution time exceeds 10 seconds under approximately 1000 concurrent requests
【Encountered Problem: Phenomenon and Impact】
Currently, due to the long SQL execution time, the concurrency level does not meet the expected value. Therefore, I am requesting help from experts. Information is as shown in the image below.

【Resource Configuration】

【Attachments: Screenshots/Logs/Monitoring】


| username: 裤衩儿飞上天 | Original post link

  1. I don’t know when your peak concurrency time is. Is it during the peak period shown in the monitoring graph? If so, looking at your TiDB and TiKV individual nodes, the CPU usage reached 400% and 800% respectively. Additionally, some TiKV nodes have higher IO. You might want to consider the issue of data hotspots and also pay attention to whether there are bottlenecks in CPU and disk IO during high concurrency periods.
  2. What is the high-concurrency operation? Is it read or write? Is that SQL statement a high-concurrency query?

Additionally, 7.2 is not an LTS version. It is recommended to use an LTS version.

| username: 小龙虾爱大龙虾 | Original post link

1000 concurrent requests? How is your TiDB configured, what type of SQL are you running, what is the cluster resource usage like, and are the read thread pools of TiKV currently maxed out?

| username: TiDBer_MCW4A4Ie | Original post link

Thank you very much for your response.

  1. The high concurrency is the peak value in the monitoring graph. The resources in the monitoring have not reached the bottleneck of our server. We are also considering generating a calculation report based on the screenshot to apply for server resources from the higher department if it is indeed the bottleneck of server resources. However, since the server bottleneck has not been reached yet, we are unsure how to optimize this issue. We are wondering if there are any optimization configurations, such as thread pools.

  2. The high concurrency is a read operation (
    SELECT
    xxx_id
    FROM
    tb_xxxx
    WHERE
    (
    xx_id = 1658553
    AND xxx_type = ‘st’
    AND xx_state = 0
    AND valid_time > ‘2024-01-18 10:49:51.242’
    );

We will be upgrading to version 7.5 soon.

| username: TiDBer_MCW4A4Ie | Original post link

Where can the read thread pool you mentioned be monitored or configured?

| username: 裤衩儿飞上天 | Original post link

  1. There might be a hotspot issue, which can be corroborated using a heatmap.
  2. Check the monitoring of the TiKV unified read pool CPU.
  3. The execution plan of the concurrent SQL.
| username: FutureDB | Original post link

The cluster setup is a bit strange. Why are TiDB and TiFlash nodes placed on the same machine, and this machine has less memory? TiFlash itself is quite memory-intensive, so wouldn’t this make it more likely for TiDB to run out of memory and cause OOM more easily?

| username: FutureDB | Original post link

You can check whether the execution plan of the SQL has changed and whether the execution plan is consistent under high concurrency and low concurrency.

| username: TiDBer_MCW4A4Ie | Original post link

  1. Heatmap
  2. Indeed, the monitoring metrics show significant imbalance in CPU usage across nodes. What could be causing this issue?
| username: TiDBer_MCW4A4Ie | Original post link

Oh, yes, it’s my first time using TiDB and there are many things I haven’t figured out yet. I’m still learning. Later, we will reallocate based on the test report.

| username: TiDBer_MCW4A4Ie | Original post link

The execution plans for high concurrency and low concurrency are the same.
Execution plan under low concurrency:


Execution plan under high concurrency:

| username: FutureDB | Original post link

  1. It seems that the execution time of the Coprocessor is roughly the same before and after, but the waiting time of the Coprocessor has significantly increased. Compare the various time metrics in the execution info of the execution plans to see which metric shows a significant discrepancy.
  2. Additionally, check the TiKV panel in Grafana to see what is primarily causing the noticeable increase in CPU usage.
| username: wangccsy | Original post link

Specific situations require specific good examples.

| username: 裤衩儿飞上天 | Original post link

  1. Full table scan, try adding appropriate indexes based on the query conditions.
  2. Uneven CPU usage between nodes indicates that most of the data in the table is on this node. You can try to distribute the data more evenly.
| username: dba远航 | Original post link

It is normal for concurrent execution to take a long time. You need to check for conflicts in concurrency, whether the SQL is using appropriate indexes, and whether there are any unreasonable large transactions.

| username: zhang_2023 | Original post link

It is recommended to check the index.

| username: 哈喽沃德 | Original post link

Take a look at the SQL execution plan to see if it uses the index. You can set the concurrency to 500 and see the effect.

| username: Soysauce520 | Original post link

Check the read thread configuration of TiKV.

| username: zhaokede | Original post link

It’s still a matter of insufficient resources. It would be better to separate PD, TiDB server, and TiFlash.