Performance Issues in 6.5.3-TiKV

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 6.5.3-tikv性能问题

| username: magongyong

【TiDB Usage Environment】Production Environment
【TiDB Version】New version v6.5.3, old version v5.4.3
【Reproduction Path】Operations performed that led to the issue
Two clusters are t1-main cluster and t2-secondary cluster, where t1 cluster is 5.4.3 and t2 cluster is 6.5.3
After the upgrade switch, the query in the new version cluster became slower. It is a batch data query. As shown below, the new cluster query takes more than 6 seconds, while the old cluster only takes more than 1 second:
New cluster

Old cluster

The parameter differences between the two clusters are as follows:

Server performance differences:
The old cluster is relatively better, with 12 servers totaling 728 CPUs.
The new cluster is relatively worse, but not by much, with 12 servers totaling 632 CPUs.

【Encountered Issue: Problem Phenomenon and Impact】
What could be the reason? Why is there a 5-fold difference in query performance? Has the read mechanism changed in the new version? Multiple tests have been conducted.

【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

| username: 像风一样的男子 | Original post link

Are the execution plans for these two versions the same?

| username: 路在何chu | Original post link

Take a look at the number of regions and CPU usage for the two clusters.

| username: magongyong | Original post link

The execution plans are the same.

New cluster

Old cluster

| username: magongyong | Original post link

The cluster load is very low, and queries are slow even during off-peak business hours.
The two clusters have a master-slave relationship, with consistent data and a similar number of regions.

| username: 像风一样的男子 | Original post link

Are the data volumes of this table consistent across the two clusters? The estimated number of rows (estRows) for the two SQL operators is different. Try running an analyze on these two tables and check again.

| username: h5n1 | Original post link

Upload the two complete execution plans, both old and new.

| username: magongyong | Original post link

New Cluster






Old Cluster





| username: h5n1 | Original post link

Is the TiKV CPU busy? Check the TiKV-detail → thread CPU.

| username: magongyong | Original post link

The CPU usage is very low, almost not used. This was checked during a business off-peak period.

| username: magongyong | Original post link

I want to know if everyone experiences this as well. Does deep pagination query in version 6.5.3 slow down, or is it just our cluster having issues?

| username: Vincent_Wang | Original post link

I looked at the execution plan, and there’s a difference:
Old cluster 5.4.3: rpc_num: 7, rpc_time: 6.03s
New cluster 6.5.3: rpc_num: 303, rpc_time: 12.5s
The rpc_num has increased significantly. How can this be optimized?