TiDB memory usage is steadily increasing

translator_bot · June 22, 2024, 3:36am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB内存一直在缓慢增涨

| username: dengqee

【TiDB Usage Environment】Production Environment
【TiDB Version】v4.0.2
【Reproduction Path】Operations performed that led to the issue
【Encountered Issue: Problem Phenomenon and Impact】
TiDB memory is continuously increasing slowly, and only one TiDB instance has high memory usage while the others have very low memory usage. All TiDB instances are load balanced through LVS, and the connection numbers are balanced.
【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】
TiDB memory monitoring

Memory distribution of the high memory node

translator_bot · June 22, 2024, 3:36am

| username: xfworld | Original post link

It’s quite an old version, hasn’t it been upgraded…

There are many reasons why TiDB doesn’t release long-term memory, the biggest one being slow queries. You can identify these queries through resource location and then solve them through optimization.

It wasn’t until version 6.X that some optimizations for OOM were properly done…

translator_bot · June 22, 2024, 3:37am

| username: h5n1 | Original post link

Is the goroutine count monitoring also increasing in the same way?

translator_bot · June 22, 2024, 3:37am

| username: dengqee | Original post link

The goroutine count did not increase, and all nodes are about the same.

translator_bot · June 22, 2024, 3:37am

| username: dengqee | Original post link

Can it be determined from memory distribution that it is caused by slow queries?

translator_bot · June 22, 2024, 3:37am

| username: h5n1 | Original post link

Check the query summary - CPS by instance and the KV request - KV request OPS monitoring to see if a particular node itself has a high number of requests.

translator_bot · June 22, 2024, 3:37am

| username: caiyfc | Original post link

I encountered this issue in version 5 before. I always suspected that it was caused by the dashboard’s slow SQL query feature. The larger the query range, the more memory a particular TiDB node uses, and it occasionally results in OOM. You can verify this.

translator_bot · June 22, 2024, 3:37am

| username: dengqee | Original post link

I took a look at this, and the requests for this node are similar to those for other nodes.

translator_bot · June 22, 2024, 3:37am

| username: h5n1 | Original post link

curl -G tidb_ip:port/debug/pprof/heap > heap.profile
Then use go tool pprof db.heap.prof → top command to see what is taking up the most.

translator_bot · June 22, 2024, 3:37am

| username: dengqee | Original post link

This result has already been posted.

translator_bot · June 22, 2024, 3:37am

| username: h5n1 | Original post link

It still seems to be related to SQL, let’s take a look at slow SQL.

translator_bot · June 22, 2024, 3:37am

| username: dengqee | Original post link

There are slow logs, and all nodes have them. The slow SQL on other nodes is not less than that on this node, but the memory of this node remains consistently high.

translator_bot · June 22, 2024, 3:37am

| username: redgame | Original post link

Memory leak

translator_bot · June 22, 2024, 3:37am

| username: dengqee | Original post link

After investigation, we found that the node with high memory usage is the stats owner node. To avoid affecting the business, we restarted that node. However, the new stats owner also experienced continuous memory growth, which seems like a memory leak.

translator_bot · June 22, 2024, 3:37am

| username: DBRE | Original post link

The same problem.

translator_bot · June 22, 2024, 3:37am

| username: zhanggame1 | Original post link

The version is a bit old, consider upgrading it.

translator_bot · June 22, 2024, 3:37am

| username: liuis | Original post link

Enable pprof and then analyze it to see where there is a leak.