Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 高并发场景下,在机器CPU和内存还没到瓶颈的下,有极少部分SQL出现Parse time和generate plan time时间很长
[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.9
[Reproduction Path] Appears intermittently after stress testing for a period
20+ node cluster, 32c128g cloud host, TiDB and TiKV mixed deployment. A small number of SQL requests have parse time and generate plan time between 1-5 seconds, which is much higher than normal values, while the 99th percentile latency is normal.
- Monitoring shows that each node’s TiDB+TiKV uses about 1000% CPU, and the memory usage is around 80G in total, both not reaching the limit.
- For the TiDB instance with delayed SQL, at the time, the CPU and memory usage were indeed higher than other nodes, but still far from the machine’s limit.
The IO is within the normal range, and the execution plan of the delayed SQL statements is also normal. The time analysis in the slow query log shows that all the time is spent on parse time and generate plan time.
The main operations are batch updates and point queries, but it doesn’t seem to be specifically related to SQL because the execution is very fast after the execution plan is generated. The delay occurs during parsing and generating the plan. Additionally, high-latency SQLs during the same time period all appear on the same TiDB server, regardless of the specific SQL type.
Profiling found that gcbgmarkworker and other GC-related processes occupied 30% of the CPU of the tidb-server.
High concurrency, TiDB and TiKV mixed deployment. I think it’s normal for the CPU to become high. Concurrent data and SQL data need to be compared to draw a conclusion. SQL parsing and plan generation, as well as GC processing, if there is a situation of resource contention, data becoming high is also normal in my opinion. High anomalies also need data analysis.
However, monitoring shows that the CPU usage of TiDB and TiKV hasn’t even reached half of the machine’s capacity, but there is already a significant delay in SQL parsing and execution plan generation. It feels like the CPU resources are not being fully utilized.
Don’t mix deployments in the cloud.
Give priority to using the execution plan cache to skip the step of generating an execution plan.