OOM After Upgrading TiDB (4.0 -> 5.3.1)

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 升级 tidb 后OOM(4.0 → 5.3.1)

| username: jianzhiyao

【TiDB Usage Environment】Production
【TiDB Version】
4.0 → 5.3.1
【Encountered Problem】
【Reproduction Path】After the upgrade, it stabilized for two to three days, then memory usage increased from 3G to over 60G.
【Problem Phenomenon and Impact】
A large number of scan_table commands appeared in TiKV.

【Core SQL】
The number of uids is around 100.

SELECT xxx FROM `t_user` WHERE `uid` IN (uid1, uid2, ..... uid100) LIMIT 1000;

【Attachments】
TiDB memory growth chart
image

TiDB QPS

TiKV commands

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

| username: xfworld | Original post link

Check the slow SQL (through the dashboard), if it doesn’t work, enable the resource determination capability (set the appropriate threshold), and see which SQLs are using full table scans.

Refer to this document:

| username: songxuecheng | Original post link

  1. Check the value of the tidb_analyze_version parameter to confirm whether an analyze operation occurred at that time.
  2. Are there any slow SQL queries?
| username: cs58_dba | Original post link

  • SQL optimization to reduce unnecessary returned data
  • Reduce large transactions by splitting them into smaller transactions
  • Adjust TiDB Server parameters to limit the memory usage of a single SQL statement
  • Other methods to mitigate TiDB Server OOM
| username: ealam_小羽 | Original post link

Are there any business program upgrades?
You can check the Dashboard.
Access the Dashboard: 访问 TiDB Dashboard | PingCAP 文档中心
Check the related slow queries and have the developers make modifications.


访问 TiDB Dashboard | PingCAP 文档中心
If there are tables with indexes but the slow queries are not using the indexes, check the health of the tables. You may need to execute ANALYZE.

ANALYZE | PingCAP 文档中心

| username: jianzhiyao | Original post link

±-----------------------+
| @@tidb_analyze_version |
±-----------------------+
| 2 |
±-----------------------+

| username: songxuecheng | Original post link

You can refer to this.

| username: TiDBer_jYQINSnf | Original post link

Check the execution plan to see if there are any changes. The most common issue encountered during an upgrade is a series of problems caused by changes in the execution plan. Also, check the table health.

| username: jianzhiyao | Original post link

The execution plan has indeed changed:

SQL:

explain SELECT xxx FROM `t_user` WHERE `uid` = 123 LIMIT 1000\G;

Execution Plan During Failure

 	id                     	task     	estRows	operator info                                                                                                                                                                                                                	actRows	execution info                                                                                                                                                                                                                                                                                                                                                                                                                                                           	memory 	disk
	Projection_7           	root     	1000   	d_database.t_user.xxx	2      																																																	2		time:601.5ms, loops:2, Concurrency:OFF                                                                                                                                                                                                                                                                                                                                                                                                                                   	3.66 KB	N/A
	└─IndexLookUp_17       	root     	1000   	limit embedded(offset:0, count:1000)                                                                                                                                                                                         	2      	time:601.5ms, loops:2, index_task: {total_time: 584.4ms, fetch_handle: 584.4ms, build: 1.21µs, wait: 2.79µs}, table_task: {total_time: 16.9ms, num: 1, concurrency: 5}                                                                                                                                                                                                                                                                                                 	11.7 KB	N/A
	  ├─Limit_16           	cop[tikv]	1000   	offset:0, count:1000                                                                                                                                                                                                         	2      	time:584.4ms, loops:3, cop_task: {num: 1, max: 584.2ms, proc_keys: 2, rpc_num: 1, rpc_time: 584.2ms, copr_cache_hit_ratio: 0.00}, tikv_task:{time:0s, loops:1}, scan_detail: {total_process_keys: 2, total_process_keys_size: 92, total_keys: 4, rocksdb: {delete_skipped_count: 0, key_skipped_count: 3, block: {cache_hit_count: 10, read_count: 0, read_byte: 0 Bytes}}}                                                                                              	N/A    	N/A
	  │ └─IndexRangeScan_14	cop[tikv]	1000   	table:t_user, index:uid(uid), range:[1595527641,1595527641], keep order:false, stats:pseudo                                                                                                                             	2      	tikv_task:{time:0s, loops:1}                                                                                                                                                                                                                                                                                                                                                                                                                                             	N/A    	N/A
	  └─TableRowIDScan_15  	cop[tikv]	1000   	table:t_user, keep order:false, stats:pseudo                                                                                                                                                                            	2      	time:16.8ms, loops:2, cop_task: {num: 2, max: 16.7ms, min: 13ms, avg: 14.9ms, p95: 16.7ms, max_proc_keys: 1, p95_proc_keys: 1, rpc_num: 2, rpc_time: 29.7ms, copr_cache_hit_ratio: 0.00}, tikv_task:{proc max:0s, min:0s, p80:0s, p95:0s, iters:2, tasks:2}, scan_detail: {total_process_keys: 2, total_process_keys_size: 344, total_keys: 2, rocksdb: {delete_skipped_count: 0, key_skipped_count: 0, block: {cache_hit_count: 19, read_count: 0, read_byte: 0 Bytes}}}	N/A    	N/A

Current Execution Plan


*************************** 1. row ***************************
           id: Projection_7
      estRows: 1.38
         task: root
access object:
operator info: d_database.xxx
*************************** 2. row ***************************
           id: └─IndexLookUp_17
      estRows: 1.38
         task: root
access object:
operator info: limit embedded(offset:0, count:1000)
*************************** 3. row ***************************
           id:   ├─Limit_16(Build)
      estRows: 1.38
         task: cop[tikv]
access object:
operator info: offset:0, count:1000
*************************** 4. row ***************************
           id:   │ └─IndexRangeScan_14
      estRows: 1.38
         task: cop[tikv]
access object: table:t_user, index:uid(uid)
operator info: range:[1595527641,1595527641], keep order:false
*************************** 5. row ***************************
           id:   └─TableRowIDScan_15(Probe)
      estRows: 1.38
         task: cop[tikv]
access object: table:t_user
operator info: keep order:false
| username: TiDBer_jYQINSnf | Original post link

I don’t see much difference in the execution plans. How about using explain analyze to see the actual execution situation?

| username: jianzhiyao | Original post link

estRows has a significant difference, estimated number of rows.

| username: TiDBer_jYQINSnf | Original post link

Oh~ The actual indexes and scans used are the same, so there shouldn’t be any issues.

| username: jianzhiyao | Original post link

What direction do you suggest for troubleshooting?

| username: buddyyuan | Original post link

Check the TiDB logs for the time period in question to see if there are any expensive statements.

| username: TiDBer_jYQINSnf | Original post link

Adjust these two parameters and check if oom-action is set to cancel. If not, change it to cancel, then check the logs to see which SQL statement is using a large amount of memory.

| username: jianzhiyao | Original post link

This is useless. After changing it and running it again, it still results in OOM.

| username: h5n1 | Original post link

The OOM issue caused by statistics not only requires parameter adjustments but also necessitates the deletion of the original v2 statistics. There are already screenshots in the previous replies.

| username: jianzhiyao | Original post link

How to delete it? Also, our upgrade method is not a direct upgrade; it involves synchronizing to another cluster and then switching over. I wonder if this will have any impact.

| username: h5n1 | Original post link

Deleting statistical information affects the execution plan.

| username: jianzhiyao | Original post link

It seems to be working now. I’ll keep an eye on it. Thanks a lot.