TiDB memory usage is steadily increasing

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB内存一直在缓慢增涨

| username: dengqee

【TiDB Usage Environment】Production Environment
【TiDB Version】v4.0.2
【Reproduction Path】Operations performed that led to the issue
【Encountered Issue: Problem Phenomenon and Impact】
TiDB memory is continuously increasing slowly, and only one TiDB instance has high memory usage while the others have very low memory usage. All TiDB instances are load balanced through LVS, and the connection numbers are balanced.
【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】
TiDB memory monitoring


Memory distribution of the high memory node

| username: xfworld | Original post link

It’s quite an old version, hasn’t it been upgraded…

There are many reasons why TiDB doesn’t release long-term memory, the biggest one being slow queries. You can identify these queries through resource location and then solve them through optimization.

It wasn’t until version 6.X that some optimizations for OOM were properly done…

| username: h5n1 | Original post link

Is the goroutine count monitoring also increasing in the same way?

| username: dengqee | Original post link

The goroutine count did not increase, and all nodes are about the same.

| username: dengqee | Original post link

Can it be determined from memory distribution that it is caused by slow queries?

| username: h5n1 | Original post link

Check the query summary - CPS by instance and the KV request - KV request OPS monitoring to see if a particular node itself has a high number of requests.

| username: caiyfc | Original post link

I encountered this issue in version 5 before. I always suspected that it was caused by the dashboard’s slow SQL query feature. The larger the query range, the more memory a particular TiDB node uses, and it occasionally results in OOM. You can verify this.

| username: dengqee | Original post link

I took a look at this, and the requests for this node are similar to those for other nodes.

| username: h5n1 | Original post link

curl -G tidb_ip:port/debug/pprof/heap > heap.profile
Then use go tool pprof db.heap.prof → top command to see what is taking up the most.

| username: dengqee | Original post link

This result has already been posted.

| username: h5n1 | Original post link

It still seems to be related to SQL, let’s take a look at slow SQL.

| username: dengqee | Original post link

There are slow logs, and all nodes have them. The slow SQL on other nodes is not less than that on this node, but the memory of this node remains consistently high.

| username: redgame | Original post link

Memory leak

| username: dengqee | Original post link

After investigation, we found that the node with high memory usage is the stats owner node. To avoid affecting the business, we restarted that node. However, the new stats owner also experienced continuous memory growth, which seems like a memory leak.

| username: DBRE | Original post link

The same problem.

| username: zhanggame1 | Original post link

The version is a bit old, consider upgrading it.

| username: liuis | Original post link

Enable pprof and then analyze it to see where there is a leak.