Heap memory usage surged after upgrading from 4.0.6 to 5.3.0

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 4.0.6升级到5.3.0后heap memory usage飙升

| username: Hacker_QHxLEOeu

【TiDB Usage Environment】Production environment
【TiDB Version】5.3.0
【Encountered Problem】heap memory usage surge
【Reproduction Path】tiup upgrade cluster from 4.0.6 to 5.3.0
【Problem Phenomenon and Impact】

【Attachment】

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

| username: tidb狂热爱好者 | Original post link

There is an OOM phenomenon in TiDB. Check if there are any slow SQL queries.

| username: Hacker_QHxLEOeu | Original post link

There are slow SQL queries, mainly concentrated on slow query analysis in the dashboard and a previous query. How can I check if there is an OOM (Out of Memory) issue?

| username: Hacker_QHxLEOeu | Original post link

The image you provided is not accessible. Please provide the text you need translated.

| username: Hacker_QHxLEOeu | Original post link

Is anyone there? Currently, there’s no OOM, but the heap memory is still increasing. I have no direction at all!

| username: Hacker_QHxLEOeu | Original post link

GC also has significant changes after the upgrade.

| username: h5n1 | Original post link

Use the above method to get the pprofile and upload the result file.

| username: Hacker_QHxLEOeu | Original post link

heap.profile (324.1 KB)

| username: h5n1 | Original post link

Is this looking at the highest blue one?

| username: Hacker_QHxLEOeu | Original post link

This is the latest update. After restarting the tidb-server, the values are not as unbalanced, and the growth of the three tidb instances is relatively synchronized.

Below is the change after the restart:

| username: h5n1 | Original post link

From the heap, the highest one is related to slow logs. It is estimated that the processing mechanism of this part may have changed after version 5.x. Check if the value of the analyze_version variable is 2, if so, change it to 1. Also, clear the historical statistics with a value of 2 as shown below.

| username: 特雷西-迈克-格雷迪 | Original post link

Check the top SQL in the dashboard to see the SQL consuming CPU. It’s possible that the execution plan changed from 4.0 to 5.0.

| username: Hacker_QHxLEOeu | Original post link

Adjusting analyze did not help, the memory usage still increased. In half an hour, the profile showed an increase of 100M, which is consistent with the increase in heap memory usage.

No abnormal queries were found in slow_query, and no analyze-related logs were found in tidb.log. Only the following two lines were found, corresponding to the time points when tidb-server was restarted:

[tidb@tikv01 log]$ grep Analyze tidb.log
[2022/09/27 09:33:28.956 +08:00] [INFO] [domain.go:1422] ["autoAnalyzeWorker exited."]
[2022/09/28 09:22:07.400 +08:00] [INFO] [domain.go:1422] ["autoAnalyzeWorker exited."]
| username: h5n1 | Original post link

Were the table statistics for analyze version 2 also deleted? Can we observe for a longer period to see how much the memory increases before it stops?

| username: Hacker_QHxLEOeu | Original post link

At that time, there was only a small test table, which has already been dropped. The core business table does not require any drop operations. I would like to ask if it is still necessary to manually analyze the core business table?

| username: h5n1 | Original post link

If the SQL plan is fine, you don’t need to do anything; the system will analyze it automatically.

| username: 人如其名 | Original post link

Patch to 5.3.3. It’s a defect: executor: fix goroutine leak in querying slow log (#32757) by ti-srebot · Pull Request #32781 · pingcap/tidb · GitHub

| username: Hacker_QHxLEOeu | Original post link

Currently, it has stabilized at 2.4G, which is an increase of 2.2G compared to before. I would like to ask, apart from the 1G of Coprocessor cache, what other configurations could cause this increase in memory? Additionally, how can I determine if there is an automatic analyze? Should I check the tidb.log for relevant logs?

| username: h5n1 | Original post link

It’s unclear where the issue is caused. Analyze the tidb.log on the stats owner node. You can use select tidb_ddl_owner() on each server to check.

| username: Raymond | Original post link

If you don’t use the dashboard to query slow SQL, check if the memory continues to grow?