Significant Performance Differences for the Same Query Between v6 and v5 Clusters

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 相同语句在v6和v5集群里性能差异巨大

| username: 扬仔_tidb

[TiDB Usage Environment] Production Environment
[TiDB Version]
Old Cluster v5.3.0
New Cluster v6.5.5
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Problem Phenomenon and Impact]
Recently migrating data from the old v5 cluster to the v6 version. The tools used are lightning+ticdc.
Issue: SQL statements are generally 2-5 times slower in the v6 version compared to the v5 version.

  1. SQL Statement
    This SQL is just an example; other complex SQLs are not included, but the new cluster is slower for all of them.
select
  *
from
  cart.cart
where
  gmt_create >= DATE(NOW()) - INTERVAL 7 DAY
  and is_delete = 0
  and sub_biz_type in (301, 101, 341, 348)
  1. Execution Plan in the old v5 cluster
  2. Execution Plan in the new v6 cluster

    [Resource Configuration]
    tidb-server: 2 units 16c/32g
    tikv: 4 units 16c/32g
    tiflash: 3 units 16c/32g
    [Attachments: Screenshots/Logs/Monitoring]
    Which parameters should be adjusted after upgrading to the new v6 version? Otherwise, the upgrade is quite awkward.
| username: 像风一样的男子 | Original post link

The execution plan in v6 has chosen to use TiFlash. Try running an analyze on these tables first.

| username: 有猫万事足 | Original post link

6.5 switched to TiFlash MPP. Although 5.0 was using TiKV, it utilized indexes, so the performance improved somewhat. If the increased execution time is due to switching from TiKV to TiFlash, it is easier to resolve.

There are several ways to restore the original execution plan.

  1. You can try turning off MPP.
    使用 MPP 模式 | PingCAP 文档中心

  2. Set Engine Isolation
    使用 TiDB 读取 TiFlash | PingCAP 文档中心

Of course, if you prefer to read from TiFlash, the MPP mode performs better with more machines. If you have the resources, you might as well try running more TiFlash instances.

| username: 芮芮是产品 | Original post link

Thank you for the lesson.

| username: Kongdom | Original post link

:thinking: If the table structure and indexes are consistent, v6 shouldn’t be using TiFlash, right? Try running ANALYZE TABLE to check.

| username: 扬仔_tidb | Original post link

Thank you all for your answers.

I have now analyzed the tables, and the results are as follows:

In the new cluster with tidb_cost_model_version=2, executing complex SQL with left join and subquery combinations takes about 4-5 seconds, while the old cluster takes around 2 seconds. The main difference in the execution plan is that the new cluster queries a much larger amount of data through TiFlash.
If the new cluster is forced to use MPP before executing the SQL, the time is about 2-3 seconds, still not as fast as the old cluster.

Next steps:

This cluster is used by the BI statistics department, and the new cluster has added TiFlash. The memory of the TiKV nodes in the old cluster reaches 90%, frequently triggering alarms.
If it really doesn’t work, we can only change tidb_cost_model_version to 1 for the developers to use.
One thing is unclear: is the tidb_cost_model_version parameter mature in version 6? The changes in execution plans between the new and old clusters, I wonder if there is an official explanation or if anyone has encountered or compared the performance differences of the same SQL in new and old versions.

| username: oceanzhang | Original post link

Have you considered rewriting the query???

| username: 扬仔_tidb | Original post link

The development team does not accept it, all statements need to be rewritten, so they are unwilling to migrate.

| username: Kongdom | Original post link

:joy: In the end, the DBA took on everything~ Let’s escalate the conflicts.

| username: 扬仔_tidb | Original post link

SQL statement

Execution plan for the new cluster
Execution plan for the old cluster

How can we optimize this SQL to make it run faster on version 6 compared to version 5?

| username: 像风一样的男子 | Original post link

Why is the data volume of the cart table different in your two versions, and why is there such a big difference between estrows and actrows in version 5?

| username: 扬仔_tidb | Original post link

Currently, it has been found that changing the time filter DATE(sysdate()) - INTERVAL 7 DAY to a specific time, without using functions, results in:
New cluster: within 2 seconds
Old cluster: around 2 seconds
Referencing another expert’s post.

Conclusion

Time functions cause index invalidation

| username: 扬仔_tidb | Original post link

The data is actually the same and is being synchronized by TiCDC.
The difference between estrows and actrows in version 5 is because the table hasn’t been analyzed for a long time in version 5, while in version 6 it was just analyzed last night.

| username: 像风一样的男子 | Original post link

After analyzing, are the execution plan and execution time the same?

| username: andone | Original post link

analyze table

| username: 有猫万事足 | Original post link

Add these tables into TiFlash, then enable forced MPP, and see how long it takes.

The main time consumption is in TiFlash searching for 2 million records out of 370 million. Direct MPP computation on these four tables might actually be faster.

| username: swino | Original post link

Benefited a lot

| username: oceanzhang | Original post link

Has it been resolved? Please share with us.