Performance is unsatisfactory after switching business from 4.0.10 cluster to 7.5.1 cluster

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 业务从 4.0.10 集群切换到 7.5.1 集群后性能不理想

| username: rebelsre

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] 7.5.1
[Reproduction Path]

SELECT
  *
FROM
  xxx
WHERE
  next_run_time <= ?
  AND id > ?
  AND account_monitor_type_id = ?
  AND remark = ?
  AND state = ?
ORDER BY
  id
LIMIT
  ? [arguments: (1717055486, 0, "11", "3", 1, 1000)];

[Encountered Problem: Problem Phenomenon and Impact]
The select statement induces slow queries, leading to high overall P99 latency.
[Resource Configuration]


[Attachments: Screenshots/Logs/Monitoring]
Execution Plan

| id                           | estRows   | estCost      | actRows  | task      | access object                                                                              | execution info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | operator info                                                                                                                                                                                                            | memory  | disk  |
| Limit_13                     | 1000.00   | 393430501.48 | 403      | root      |                                                                                            | time:1m33.2s, loops:2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | offset:0, count:1000                                                                                                                                                                                                     | N/A     | N/A   |
| └─IndexLookUp_47             | 1000.00   | 393430501.48 | 403      | root      |                                                                                            | time:1m33.2s, loops:2, index_task: {total_time: 1m32.8s, fetch_handle: 1.74s, build: 3.23s, wait: 1m27.8s}, table_task: {total_time: 7m45.2s, num: 657, concurrency: 5}, next: {wait_index: 8.32ms, wait_table_lookup_build: 402.1µs, wait_table_lookup_resp: 1m33.2s}                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                                                                          | 24.4 MB | N/A   |
|   ├─IndexRangeScan_44(Build) | 200660.78 | 40834468.62  | 13366428 | cop[tikv] | table:t_rhino_monitor_wemedia_user, index:account_monitor_type_id(account_monitor_type_id) | time:68.9ms, loops:13098, cop_task: {num: 452, max: 153.5ms, min: 1.04ms, avg: 14.9ms, p95: 74.7ms, max_proc_keys: 50144, p95_proc_keys: 50144, tot_proc: 4.76s, tot_wait: 103ms, rpc_num: 452, rpc_time: 6.72s, copr_cache_hit_ratio: 0.67, build_task_duration: 205.6µs, max_distsql_concurrency: 2}, tikv_task:{proc max:134ms, min:0s, avg: 34.2ms, p80:58ms, p95:73ms, iters:14847, tasks:452}, scan_detail: {total_process_keys: 4180992, total_process_keys_size: 192325632, total_keys: 4918093, get_snapshot_time: 32.2ms, rocksdb: {delete_skipped_count: 68, key_skipped_count: 4917946, block: {cache_hit_count: 7466, read_count: 1, read_byte: 16.0 KB, read_time: 56.4µs}}}                 | range:(11 0,11 +inf], keep order:true                                                                                                                                                                                    | N/A     | N/A   |
|   └─Selection_46(Probe)      | 1000.00   | 91856163.66  | 403      | cop[tikv] |                                                                                            | time:7m42.6s, loops:727, cop_task: {num: 689, max: 10s, min: 4.59ms, avg: 675.4ms, p95: 4.89s, max_proc_keys: 20936, p95_proc_keys: 20480, tot_proc: 3m45.3s, tot_wait: 172.9ms, rpc_num: 689, rpc_time: 7m45.3s, copr_cache_hit_ratio: 0.00, build_task_duration: 218.5ms, max_distsql_concurrency: 2}, tikv_task:{proc max:9.95s, min:3ms, avg: 658.5ms, p80:619ms, p95:4.84s, iters:16451, tasks:689}, scan_detail: {total_process_keys: 13366428, total_process_keys_size: 1576436660, total_keys: 17280280, get_snapshot_time: 63.6ms, rocksdb: {delete_skipped_count: 460, key_skipped_count: 14413892, block: {cache_hit_count: 21214777, read_count: 22, read_byte: 574.4 KB, read_time: 4.68ms}}} | eq(dt_datastory_rhino_kol.t_rhino_monitor_wemedia_user.remark, "3"), eq(dt_datastory_rhino_kol.t_rhino_monitor_wemedia_user.state, 1), le(dt_datastory_rhino_kol.t_rhino_monitor_wemedia_user.next_run_time, 1717055486) | N/A     | N/A   |
|     └─TableRowIDScan_45      | 200660.78 | 61817244.69  | 13366428 | cop[tikv] | table:t_rhino_monitor_wemedia_user                                                         | tikv_task:{proc max:9.94s, min:3ms, avg: 654.9ms, p80:615ms, p95:4.84s, iters:16451, tasks:689}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | keep order:false                                                                                                                                                                                                         | N/A     | N/A   |




| username: kkpeter | Original post link

Your index is unreasonable if you are filtering 400 out of 10 million.

| username: 我是人间不清醒 | Original post link

Does next_run_time have an index? After hitting the index on account_monitor_type_id, there are still 10 million records.

| username: TiDB_C罗 | Original post link

Take a look at the table structure.

| username: TiDBer_C33 | Original post link

The index selectivity is too poor, it’s best to create a composite index.

| username: rebelsre | Original post link

+-------------------------+------------------+------+------+-------------------+-----------------------------------------------+
| Field                   | Type             | Null | Key  | Default           | Extra                                         |
+-------------------------+------------------+------+------+-------------------+-----------------------------------------------+
| id                      | bigint(20)       | NO   | PRI  | NULL              | auto_increment                                |
| init_seed_value         | varchar(255)     | NO   | MUL  | NULL              |                                               |
| interval                | int(10) unsigned | YES  |      | NULL              |                                               |
| next_run_time           | int(10) unsigned | YES  | MUL  | NULL              |                                               |
| account_monitor_type_id | int(11)          | YES  | MUL  | NULL              |                                               |
| state                   | tinyint(4)       | YES  |      | 1                 |                                               |
| create_date             | datetime         | YES  |      | CURRENT_TIMESTAMP |                                               |
| update_date             | datetime         | YES  |      | CURRENT_TIMESTAMP | DEFAULT_GENERATED on update CURRENT_TIMESTAMP |
| lose_tolerance_level    | tinyint(4)       | YES  |      | NULL              |                                               |
| remark                  | text             | YES  |      | NULL              |                                               |
| is_dynamic              | tinyint(4)       | YES  |      | 1                 |                                               |
| last_run_time           | int(10) unsigned | YES  |      | NULL              |                                               |
| augment_date            | varchar(255)     | YES  |      | NULL              |                                               |
+-------------------------+------------------+------+------+-------------------+-----------------------------------------------+
| username: TiDB_C罗 | Original post link

“show create table xxx” is better, it contains index information.

| username: caiyfc | Original post link

Index issue: TiKV scanned 13,366,428 keys through the index account_monitor_type_id, which is why you see the unified read pool CPU being very high. Properly planning the index and reducing the number of keys scanned will solve the problem.

| username: TiDB_C罗 | Original post link

You can try rebuilding the index. Make sure to create the new one before deleting the old one, and also check if there are any queries using the index name in the service:

  1. Create a new index on the same column with a different name.
  2. Then delete the old index.
| username: 有猫万事足 | Original post link

Only

AND account_monitor_type_id = ?

this condition hit the index.
The remaining conditions are mostly filtered in the selection operator.

| username: h5n1 | Original post link

  1. Index issue: Check if there are better indexes.
  2. Limit not pushed down: It might be due to some parameters not being enabled after the upgrade or an issue with the statistics. Try manually executing SELECT /*+ LIMIT_TO_COP() */.
| username: Jellybean | Original post link

As mentioned by the experts above, after filtering through the index, it still needs to scan 13.36 million rows to return to the table, which is obviously slow with such a large amount of data.
The original poster should check whether the index usage is reasonable. If the index is not used correctly, you can try analyzing the table or using force index to enforce the use of the index. If the index is used correctly, then this issue should have existed before the upgrade as well. You can expand the monitoring panel to confirm, and it is still necessary to optimize the index (add an index or re-specify it).

| username: rebelsre | Original post link

This time we are setting up a brand new 7.5.1 cluster, not upgrading from an old one. We switched to the new cluster through CDC incremental migration. Indeed, the issue existed in the old cluster as well, but the latency wasn’t as high back then. After the migration, the latency roughly doubled. We have performed the analyze operation, but the effect is not significant. We have added a composite index and are currently monitoring the situation.

| username: zhaokede | Original post link

Is there no problem with the same SQL in the older version?

| username: zhaokede | Original post link

Are the execution plans the same?

| username: TiDBer_C33 | Original post link

There are only 400 records in total, using limit 1000 won’t be of much use, right?

| username: tidb菜鸟一只 | Original post link

delete_skipped_count and key_skipped_count represent whether your deleted data has been garbage collected (GC) and compacted. If it hasn’t been GC’d or compacted yet, due to TiDB’s MVCC mechanism, it will still be scanned. Although it won’t be returned, scanning this data still takes time. Once GC or compaction is complete and this garbage data is no longer scanned, the query will become faster.

The main reason your SQL is slow now is that there is too much data in the new cluster that hasn’t been compacted. Try executing the SQL on the old cluster and see if the key_skipped_count is much lower than in the new cluster. If possible, you can perform a compaction on the new cluster, specifically targeting this table.

| username: rebelsre | Original post link

It’s TiKV, not TiFlash. I checked, and this compact can only be used on TiFlash, right?

| username: tidb菜鸟一只 | Original post link

The alter table compact command is only for TiFlash. For TiKV, you can only use a script. Here’s something similar for you.

Using shell to compact only the oltp.sbtest8 table (replace with your table name), add -c write -d kv:

mysql -uroot -pXXX -hxxx -PXXX information_schema -e "select region_id from tikv_region_status where db_name='oltp' and table_name='sbtest8'" > region_list
cat region_list | while read line
do
    tiup ctl:v5.1.0 tikv --host xxxx:20160 compact -r $line -d kv -c write --threads 1 --bottommost force
    tiup ctl:v5.1.0 tikv --host xxx:20160 compact -r $line -d kv -c default --threads 1 --bottommost force
done