Upgrading to version 7.1.0 results in slower queries, with copr_cache_hit_ratio dropping by two orders of magnitude

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 升级7.1.0 版本查询变慢,copr_cache_hit_ratio降低2个数量级

| username: TiDBer_yyy

Bug Report
After upgrading the cluster to 7.10, the execution of previously simple SQL queries has slowed down, often doubling or more. Comparing the plans, it was found that total_process_keys increased by 2 orders of magnitude.
[TiDB Version] 7.1.0
[Impact of the Bug] The overall task scheduling of the business is affected.

[Possible Steps to Reproduce the Issue]

[Observed Unexpected Behavior] Execution plan in version 7

| StreamAgg_20               | 1.00         | 1         | root      |                                                                             | time:10.1s, loops:2, RU:525106.464870                                                                                                                                                                                                                                                                                                                                                                   | funcs:count(Column#10)->Column#5 | 4.19 KB   | N/A  |
| └─IndexReader_21           | 1.00         | 405       | root      |                                                                             | time:10.1s, loops:2, cop_task: {num: 405, max: 1.03s, min: 301.6µs, avg: 364.4ms, p95: 617.5ms, max_proc_keys: 1177355, p95_proc_keys: 1139515, tot_proc: 2m17.8s, tot_wait: 168.5ms, rpc_num: 405, rpc_time: 2m27.6s, copr_cache_hit_ratio: 0.09, build_task_duration: 1.05ms, max_distsql_concurrency: 15}                                                                                            | index:StreamAgg_8                | 721 Bytes | N/A  |
|   └─StreamAgg_8            | 1.00         | 405       | cop[tikv] |                                                                             | tikv_task:{proc max:1.03s, min:170ms, avg: 397.6ms, p80:506ms, p95:617ms, iters:313964, tasks:405}, scan_detail: {total_process_keys: 293431895, total_process_keys_size: 31397210713, total_keys: 293432280, get_snapshot_time: 125.4ms, rocksdb: {delete_skipped_count: 228, key_skipped_count: 296859919, block: {cache_hit_count: 918691, read_count: 5, read_byte: 104.4 KB, read_time: 73.8µs}}}  | funcs:count(1)->Column#10        | N/A       | N/A  |
|     └─IndexFullScan_19     | 321297271.00 | 321297993 | cop[tikv] | table:dws_datav_new_register_user_real, index:uniq_card(CH_ID_CARD_ENCRYPT) | tikv_task:{proc max:1.03s, min:170ms, avg: 396.6ms, p80:503ms, p95:616ms, iters:313964, tasks:405}     

[Expected Behavior] Execution plan in version 5

id                    	task     	estRows  	operator info                                                                                	actRows  	execution info                                                                                                                                                                                                                                                                          	memory 	disk
	StreamAgg_20          	root     	1        	funcs:count(Column#10)->Column#5                                                             	1        	time:584.6ms, loops:2                                                                                                                                                                                                                                                                   	4.18 KB	N/A
	└─IndexReader_21      	root     	1        	index:StreamAgg_8                                                                            	405      	time:584.6ms, loops:2, cop_task: {num: 405, max: 559.3ms, min: 221.3µs, avg: 5.65ms, p95: 10.8ms, max_proc_keys: 1128106, p95_proc_keys: 0, tot_proc: 1.51s, tot_wait: 488ms, rpc_num: 405, rpc_time: 2.28s, copr_cache_hit_ratio: 0.99}                                               	1.17 KB	N/A
	  └─StreamAgg_8       	cop[tikv]	1        	funcs:count(1)->Column#10                                                                    	405      	tikv_task:{proc max:1.02s, min:170ms, p80:492ms, p95:673ms, iters:311496, tasks:405}, scan_detail: {total_process_keys: 2852371, total_keys: 2852374, rocksdb: {delete_skipped_count: 9, key_skipped_count: 2852380, block: {cache_hit_count: 4501, read_count: 0, read_byte: 0 Bytes}}}	N/A    	N/A
	    └─IndexFullScan_19	cop[tikv]	318769361	table:dws_datav_new_register_user_real, index:uniq_card(CH_ID_CARD_ENCRYPT), keep order:false	318769817	tikv_task:{proc max:1.02s, min:169ms, p80:490ms, p95:672ms, iters:311496, tasks:405}, scan_detail: {total_process_keys: 0, total_keys: 0, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 0, read_count: 0, read_byte: 0 Bytes}}}                     	N/A    	N/A
| username: 有猫万事足 | Original post link

The cache hit ratio also differs by an order of magnitude.

Check if there have been any changes in the cache configuration?

| username: yiduoyunQ | Original post link

key_skipped_count: 296859919 Data update frequency is too high

Reference: 慢查询日志 | PingCAP 文档中心

| username: TiDBer_yyy | Original post link

The in-place upgrade method used, will upgrading to 7.1 change this configuration?
Boss, is there any way to check the memory size capacity-mb?
I found that after the upgrade, there are a lot of cache misses.
The document mentioned in Coprocessor Cache 少量数据缓存失效?大量数据copr_cache_hit_ratio信息出现在cop[tikv] - #4,来自 这道题我不会 - TiDB 的问答社区 talks about cache usage scenarios. But in practice, capacity-mb hasn’t been changed.

| username: TiDBer_yyy | Original post link

There was no change in the business scenario before and after the upgrade, not sure if it’s because of too many updates.

| username: TiDBer_yyy | Original post link

I found a bug, but it doesn’t mention how to fix it. Do we just have to wait for an upgrade?

| username: tidb菜鸟一只 | Original post link

Only compact the table dws_datav_new_register_user_real, add -c write -d kv

mysql -uroot -pXXX -hxxx -PXXX information_schema -e "select region_id from tikv_region_status where db_name='oltp' and table_name='sbtest8'" > region_list
cat region_list | while read line
do
  tiup ctl:v7.1.0 tikv --host xxxx:20160 compact -r $line -d kv -c write --threads 1 --bottommost force
  tiup ctl:v7.1.0 tikv --host xxx:20160 compact -r $line -d kv -c default --threads 1 --bottommost force
done
| username: TiDBer_yyy | Original post link

Master, what is the principle behind doing this?

Also, which TiKV should I choose for host xxx:20160?

| username: ajin0514 | Original post link

You can refer to 慢查询日志 | PingCAP 文档中心

| username: TiDBer_yyy | Original post link

I don’t quite understand how to handle it…

| username: tidb菜鸟一只 | Original post link

Manually compact the regions corresponding to this table. Although there isn’t much data in your table, there is a lot of tombstone data in the regions that need to be compacted. Use SELECT * FROM INFORMATION_SCHEMA.TIKV_REGION_PEERS; to determine which regions correspond to which TiKV nodes.

| username: TiDBer_yyy | Original post link

Okay, I’ll give it a try.

| username: h5n1 | Original post link

Did it slow down immediately after upgrading to version 7 or after some time? Is the execution plan posted here for the same SQL?

| username: 有猫万事足 | Original post link

This bug is probably unrelated, as it is generally not activated. The feature will only be GA by the end of the year.
7
key_skipped_count: 296859919
5
key_skipped_count: 2852380

* Rocksdb_key_skipped_count: The number of deleted (tombstone) keys encountered by RocksDB when scanning data.

Let’s start with the number of tombstone keys.

Check the GC time, whether the safepoint has advanced, or perform manual compaction.

| username: TiDBer_yyy | Original post link

Immediately slowed down

| username: TiDBer_yyy | Original post link

image

It is advancing.

There are currently many empty regions, and they are being cleaned up continuously.

| username: TiDBer_yyy | Original post link

The same SQL, simply querying count()

| username: h5n1 | Original post link

It’s quite strange why the total_process_keys increased after the upgrade.

  • total_keys: The number of keys scanned by the Coprocessor.
  • processed_keys: The number of keys processed by the Coprocessor. Compared to total_keys, processed_keys does not include old versions of MVCC. If there is a large difference between processed_keys and total_keys, it indicates that there are many old versions.

The two execution plans are the same, and the act rows of the index scan are also very close. Theoretically, the total_processed_keys of version 7 should be the closest to the actual value.

| username: 人如其名 | Original post link

If there is a test environment, it is recommended to run only this statement and check the execution plan information after multiple runs. It seems like the behavior of other statements has changed, occupying the region result set cache, causing the cache hit rate of this statement to be too low. If running only this statement in the test environment still does not meet expectations, it might be a product issue. If it runs faster after multiple attempts, you need to investigate which other statements are using the cache. This can be viewed from two aspects: one is whether the execution plans of some simple SQL statements have changed, occupying the cache; the other is to see if version 7.1 has increased the scenarios for cache usage compared to previous versions. If it has increased, it might also be occupying cache space.

| username: h5n1 | Original post link

Has your entire cluster’s SQL performance slowed down?