SQL execution is very slow, abnormally slow

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: sql执行很慢,不正常的慢

| username: jiangh

The SQL execution is very slow, taking 20 minutes to produce results, but the CLUSTER_TIDB_TRX table shows the state as idle. What could be the reason?

| username: tidb菜鸟一只 | Original post link

Check the dashboard page to see where the SQL is slow…

| username: zhanggame1 | Original post link

Is the SQL slow across the board or just for specific queries?

| username: jiangh | Original post link

Could you please show the execution plan?

| username: Kongdom | Original post link

Post the execution plan and let’s take a look.

| username: jiangh | Original post link

sqlplan.txt (87.1 KB)

| username: Kongdom | Original post link

What is this step doing? According to the documentation, this step cannot be pushed down to TiKV yet. :joy:

| username: 有猫万事足 | Original post link

The data volume is only 40,000, but it looped 800 million times. There must be a problem with the SQL here. Is it possible that there is no table join, so it resulted in a Cartesian product? If no alias is used, the table name is ac_value_mapping_config. Pay special attention to this.

| username: cassblanca | Original post link

Describe the business requirement scenario and post the anonymized SQL.

| username: tidb菜鸟一只 | Original post link

It is obviously slow on the table ps_accounting.ac_value_mapping_config. Check the filter conditions and join conditions of this table.

| username: redgame | Original post link

There is no filtering, Cartesian product.

| username: 人如其名 | Original post link

  1. From the execution plan, we can see that Apply_63 indicates a subquery, so it is not a Cartesian product issue.
  2. Looking at IndexRangeScan_340, decided by [eq(ps_accounting.ac_value_mapping_config.mapping_group, czhaoshang) eq(ps_accounting.ac_value_mapping_config.mapping_code, czhaoshang_district_mapping) eq(ps_accounting.ac_value_mapping_config.in_col1_value, ps_accounting.ac_std_order_address.city_code)], we can see this is a correlated subquery. Combined with point 1, we know it is not a de-correlated subquery.
  3. From the unique index uniq_code in IndexRangeScan_340 and MaxOneRow_336, we can see this is a unique index. This is further confirmed by the actRows=41354 in Projection_64 and the max_proc_keys:1 and total_process_keys: 41354 in IndexRangeScan_340.
  4. Therefore, although this correlated subquery uses a CARTESIAN left outer join in Apply_63, the efficiency should not be excessively slow because the Build side (Projection_64) has only 41354 rows. Thus, I understand that this query should not take 20m33s to execute.
  5. From the num: 41354 in the cop_task of IndexRangeScan_340, we can see there is no issue when requesting data from TiKV. The TiKV task execution also appears normal. However, the execution time of this operator is extremely abnormal: time: 255h44m54.2s, loops: 835492652. Why are there so many loops? Normally, each Next request in the database corresponds to one loop operation, and one Next request is a chunk of data (representing one or more rows). The main issue here is why the subquery is running empty Next requests and consuming significant time. I feel this is a product issue.
| username: 有猫万事足 | Original post link

You are still impressive. :+1:

Although the final conclusion is not yet certain. It might be generated during the subquery processing, not necessarily manually. But the existence of the Cartesian product is certain.

For Joins that generate Cartesian products, they will be marked as CARTESIAN in the execution plan.

| username: Fly-bird | Original post link

Is it resolved?

| username: 人如其名 | Original post link

I understand that for correlated subqueries, the CARTESIAN label is added because the actual execution uses a correlation algorithm (similar to lookup join), but it is a subquery rather than a join (only after decorrelation does it take the form of a join), so the on condition is empty, and the execution plan adds the CARTESIAN label. A normal join with CARTESIAN indicates a true Cartesian product has occurred (for two reasons: 1. missing on condition, 2. decorrelated correlated subqueries of anti-semi joins without a not null definition also result in a Cartesian product, which has been optimized in version 6.3 with the addition of Null-Aware Semi Join capability, but it is not enabled by default because TiFlash does not support it. It should be supported in version 7.1. 用 EXPLAIN 查看子查询的执行计划 | PingCAP 文档中心, related issue post: 一个 tidb 无索引查询性能问题请教 - TiDB 的问答社区, refer to the best answer by Amao).

There is also a “partial” Cartesian product that is not labeled as CARTESIAN in the execution plan. For example, if some data matches are severely many-to-many, it may not be apparent in the execution plan, but a Cartesian product will occur during internal execution.

| username: Kongdom | Original post link

We can only wait for the original poster to post the sentence and take a look.