The same SQL type follows different execution plans, which is very strange

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 同一种sql类型走不同执行计划,很奇怪

| username: 路在何chu

[TiDB Usage Environment] Production Environment
[TiDB Version]
4013
[Reproduction Path] What operations were performed when the issue occurred



Adding ORDER BY id DESC LIMIT 100; this sorting execution plan turns into a full table scan, not using the index anymore

| username: 路在何chu | Original post link

However, if count(*) is not 0, then everything is normal.



Adding ORDER BY id DESC LIMIT 100; is also normal.

| username: TiDBer_小阿飞 | Original post link

It might be caused by where, order by, or limit. Sometimes order by + limit can change the execution order of the original SQL. When the proportion of limit is very small, it may choose to order by first and then evaluate where. The optimizer should compare the size of the limit with the result set size obtained by querying the index file using where to decide whether to scan the table or use the index. You can use the force index statement to specify the index you want to use, or you can change order by id to order by id + 0.

| username: 路在何chu | Original post link

The only difference is that one can retrieve data, while the other cannot.

| username: 路在何chu | Original post link

Moreover, the statistical information was not invalid at that time.

| username: 路在何chu | Original post link

This SQL has only run a dozen times, and everything else is normal. However, this type of SQL runs thousands of times in half an hour, with a very high execution frequency.

| username: 有猫万事足 | Original post link

It seems to be a bug.

You can refer to the following documentation to save two sets of execution site information, one normal and one abnormal. Then, go to GitHub and open an issue.

| username: h5n1 | Original post link

Take a guess, try swapping the order of tor and tod in the SQL.

| username: 路在何chu | Original post link

TiDB introduced the PLAN REPLAYER command in v5.3.0, which is a bit awkward.

| username: 路在何chu | Original post link

It’s impossible to change the SQL, as this is a core SQL. The one that follows the wrong execution plan was executed a dozen times just yesterday.

| username: h5n1 | Original post link

You need to manually test it. As a temporary solution, you can use SPM to bind the correct execution plan.

| username: 路在何chu | Original post link

It should not be a bug. I tested it in the test environment, and count(*) is 0. The execution plan is normal.

| username: TiDBer_小阿飞 | Original post link

Could it be an issue with analyze? Try deleting and re-analyzing it?

| username: 路在何chu | Original post link

It’s too risky to do this in production. The SQL is normal now, and this situation occurred for the first time yesterday.

| username: 小龙虾爱大龙虾 | Original post link

I suspect that the optimizer estimates the costs of both options to be similar.

When there is no “ORDER BY”: The optimizer determines that using the index is optimal.

When “ORDER BY LIMIT 100” is added: The optimizer has two choices. The first is to proceed as before, then use a Top operator to sort and take the first 100 rows. The second is that it notices you are ordering by the “id” of the “tod” table, which can take advantage of the primary key’s order (assuming the “tod” table is a clustered table and “keep order” is true when scanning the whole table, though the screenshot is missing, the original poster can confirm this). With this optimization and the limit, it doesn’t need to scan the whole table; it just needs to scan and fetch results until it reaches 100 rows and then stop. However, during actual execution, it finds that it has scanned a lot of data from the “tod” table without finding any matches. After scanning the entire table, it still finds none. The optimizer evaluates both options as similar, so it’s still an issue with the optimizer model (though not necessarily a problem, as the optimizer is based on statistical estimates and can sometimes be inaccurate). To ensure it always uses the index, you can add a more selective condition to the “tor” table for testing.

| username: forever | Original post link

Could you please check if the production environment and the test environment have the same version, and if the method of collecting statistical information is the same?

| username: 人如其名 | Original post link

The first execution plan, for the tor table, after filtering the tor.customer_id, tor.account_id, and tor.symbol fields, the optimizer estimated the number of records to be 218250.82, but the actual execution resulted in 0 records! This also led to the optimizer mistakenly thinking that the result set was larger in the second execution plan, choosing to use the clustered index primary key id of the tod table to avoid sorting, and associating the tor table to quickly return results. However, in reality, the tor table that meets the conditions will not reach 100 records, because the actual result that meets the conditions is 0. Therefore, all tod table data was requested, and the index association with the tor table caused the execution time to be too long.

Here, the optimizer’s estimated result based on statistical information is seriously inconsistent with the actual result. Either the statistical information is too old, the Count-Min Sketch hash collision is too severe, or there is a problem with the evaluation of the combined filtering factor of multiple fields.

It is recommended to first collect statistical information, and using hints is the best choice.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.