Poor performance when executing IN condition queries on a table with 40 million rows and a size of 10GB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 表格数量在 4000 万,大小为 10G ,执行 IN 条件查询,性能差

| username: withseid

Last Friday around 11 AM, I noticed high latency in the cluster on the dashboard, with 99.9% latency at 6 seconds and 99% request latency at 3 seconds.

Upon checking the slow SQL, I found that the query on one of the tables was causing the issue. The SQL is very simple, just a regular IN condition query, and the explain analyze shows that it uses an index.

Detailed explain analyze information
explain.txt (144.9 KB)

The cluster has 3 TiKV, 3 TiDB, and 3 PD, deployed in a mixed manner using NUMA. TiKV is deployed on numa_node 0: 32 VCore 64G, and TiDB and PD are deployed together on numa_node 1: 32 VCore 64G.

Similar issues have occurred before. When there are a large number of IN condition queries on large tables (hundreds of millions of rows) with the number of IDs in the query ranging from 500 to 2000, the system latency tends to be high.

| username: h5n1 | Original post link

QPS, has the concurrency of this SQL increased? Check the resource cleanup when it’s slow. How is the speed when executing this SQL alone after the system returns to normal?

| username: withseid | Original post link

When executed individually, it is relatively fast and doesn’t take that long.

| username: wuxiangdong | Original post link

This statement has high concurrency at that point, the CPU is likely under heavy pressure, causing delays.

| username: withseid | Original post link

The latency of individual executions is normal now.

Last Friday during this time period, the cluster’s QPS was also relatively low, less than 1k, but there were quite a few IN condition queries, all similar SQLs.

| username: withseid | Original post link

During that period, the TiKV CPU was quite busy.

| username: h5n1 | Original post link

使用 PingCAP Clinic Diag 采集 SQL 查询计划信息 | PingCAP 文档中心 Install clinic to collect information on sudden latency spikes and the previous half-hour data.

| username: Jiawei | Original post link

Check the TiDB Dashboard to see which part of the slow SQL is taking the most time, and then look at the corresponding module’s monitoring based on that point to locate the issue.

| username: withseid | Original post link

The diag here can only be updated to v.0.3.2, but the documentation requires an update to v0.7.3. Tiup is installed offline.

| username: withseid | Original post link

| username: withseid | Original post link

The TiDB cluster here is v5.3.0. Do I need to upgrade it to use the diag client?

| username: Jiawei | Original post link

Check the anomalies in tikv-details-coprocessor, but it’s still best to follow the previous suggestion to collect clinic information for easier troubleshooting.

| username: buddyyuan | Original post link

Check the Thread CPU in tikv-details. Is it reaching a bottleneck?

| username: h5n1 | Original post link

You can first try if this version can collect: tiup diag collect <cluster-name> -f <start-time> -t <end-time>. If it doesn’t work, download a higher version of the TiDB offline package, then tiup mirror set /path-to-offline-package and then update diag (you may need to update tiup).

| username: withseid | Original post link

Okay, I’ll give it a try.

| username: withseid | Original post link

Here are two files, which contain information on Thread CPU and Coprocessor Detail for the respective time periods.

| username: buddyyuan | Original post link

The unified thread pool has reached a bottleneck.

| username: withseid | Original post link

So, should this be done through parameter tuning or do we need to add more machines?

| username: withseid | Original post link

This is the CPU usage of the Unified Read Pool in the last 30 minutes. It is found to be very unbalanced, with one machine being very high and the other two machines being very low. This cluster uses HAProxy for load balancing.

| username: 人如其名 | Original post link

I agree. Looking at the execution information of that slow SQL, it seems that most of the time is spent waiting in the queue. It should be queued here.