In high concurrency scenarios, with CPU and memory not yet at their limits, a very small number of SQL queries experience long parse time and generate plan time

translator_bot · July 11, 2024, 7:28am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 高并发场景下，在机器CPU和内存还没到瓶颈的下，有极少部分SQL出现Parse time和generate plan time时间很长

| username: TiDBer_H1ifFPFW

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.9
[Reproduction Path] Appears intermittently after stress testing for a period

20+ node cluster, 32c128g cloud host, TiDB and TiKV mixed deployment. A small number of SQL requests have parse time and generate plan time between 1-5 seconds, which is much higher than normal values, while the 99th percentile latency is normal.

Monitoring shows that each node’s TiDB+TiKV uses about 1000% CPU, and the memory usage is around 80G in total, both not reaching the limit.
For the TiDB instance with delayed SQL, at the time, the CPU and memory usage were indeed higher than other nodes, but still far from the machine’s limit.

translator_bot · July 11, 2024, 9:27am

| username: 呢莫不爱吃鱼 | Original post link

Check TiDB logs: Review the detailed TiDB logs, especially during the times when SQL requests experience delays. The logs contain clues about query performance issues.
Analyze slow queries: Use TiDB’s slow query log feature to identify SQL statements that cause delays. Check if these queries have performance bottlenecks, such as complex query logic or large data processing.
Query execution plan: For delayed SQL statements, review their execution plans. Sometimes, even if the parsing time and plan generation time are normal, the execution plan itself may cause performance issues. Check for problems like full table scans, excessive joins, or subqueries.
Monitor TiDB and TiKV load: Although CPU and memory usage have not reached their limits, this does not mean there is no load. Check for I/O bottlenecks or issues in the TiKV storage layer.

translator_bot · July 11, 2024, 9:27am

| username: TiDBer_H1ifFPFW | Original post link

The IO is within the normal range, and the execution plan of the delayed SQL statements is also normal. The time analysis in the slow query log shows that all the time is spent on parse time and generate plan time.

translator_bot · July 11, 2024, 9:27am

| username: h5n1 | Original post link

What kind of SQL is it?

translator_bot · July 11, 2024, 9:27am

| username: TiDBer_H1ifFPFW | Original post link

The main operations are batch updates and point queries, but it doesn’t seem to be specifically related to SQL because the execution is very fast after the execution plan is generated. The delay occurs during parsing and generating the plan. Additionally, high-latency SQLs during the same time period all appear on the same TiDB server, regardless of the specific SQL type.

translator_bot · July 11, 2024, 9:59am

| username: TiDBer_H1ifFPFW | Original post link

Profiling found that gcbgmarkworker and other GC-related processes occupied 30% of the CPU of the tidb-server.

translator_bot · July 11, 2024, 10:03am

| username: TIDB-Learner | Original post link

High concurrency, TiDB and TiKV mixed deployment. I think it’s normal for the CPU to become high. Concurrent data and SQL data need to be compared to draw a conclusion. SQL parsing and plan generation, as well as GC processing, if there is a situation of resource contention, data becoming high is also normal in my opinion. High anomalies also need data analysis.

translator_bot · July 12, 2024, 2:49am

| username: TiDBer_H1ifFPFW | Original post link

However, monitoring shows that the CPU usage of TiDB and TiKV hasn’t even reached half of the machine’s capacity, but there is already a significant delay in SQL parsing and execution plan generation. It feels like the CPU resources are not being fully utilized.

translator_bot · July 12, 2024, 2:50am

| username: yiduoyunQ | Original post link

Don’t mix deployments in the cloud.

translator_bot · July 12, 2024, 7:52am

| username: 有猫万事足 | Original post link

Give priority to using the execution plan cache to skip the step of generating an execution plan.