Memory of TiDB component nodes grows slowly until it is exhausted. Is it a bug?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb组件节点内存缓慢增长,直至消耗完,BUG了?

| username: jaybing926

[TiDB Usage Environment] Production Environment
[TiDB Version]
Cluster version: v4.0.9
[Encountered Problem: Symptoms and Impact]
Today, the business reported that the system queries are very slow. The speed is inconsistent, sometimes fast and sometimes slow. There haven’t been any changes to the database, but monitoring shows that the TiDB component nodes are gradually consuming resources. The server memory is fully utilized. Could this be related? Can any experts help analyze this? Thank you~

The cluster had its tikv_gc_life_time modified to one month (this was changed for incremental backups, and our business scenario rarely involves modifications). Could this be related? I changed tikv_gc_life_time back to 1 hour, but the memory usage hasn’t decreased.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
Cluster Information:


TIDB Monitoring Panel Screenshots:

TIDB Component Log Information: (Nothing abnormal, but logs are being generated very quickly)

| username: h5n1 | Original post link

Check the slow SQL during that time period. Does the table have any text or LOB type fields?

| username: jaybing926 | Original post link

We have a lot of slow queries. Does this text LOB type have any impact?

| username: h5n1 | Original post link

I previously encountered a situation where text caused the memory to be exhausted, but that was growing very quickly. Looking at yours, your TiDB memory is in a state of slow growth, and there is a sudden increase in goroutines, which feels like a leak.

| username: jaybing926 | Original post link

Just now, a TiDB node’s heap memory suddenly dropped. What is the purpose of this heap memory? Will it affect the GC data?

The memory dropped suddenly. What happened? I modified the gc_life_time at that time. The GC data is stored in TiKV, right? Will it affect TiDB’s memory? I’m a bit confused and don’t quite understand the principle.

-----Got it, the sudden drop happened because the TiDB node process crashed and restarted, so the memory was released.

| username: buddyyuan | Original post link

An OOM occurred. Check the memory-usage-alarm-ratio setting. Then look at the TiDB logs to see if the heap, running_sql, and goroutine information were dumped before the OOM.

Next, analyze the heap and running_sql to determine what caused the OOM issue.

| username: alfred | Original post link

Post the dump information before the OOM to take a look, it could also be a bug related to memory leakage.

| username: jaybing926 | Original post link

The log file is too large, I filtered out the OOM information as follows, please take a look:

[root@b26 tidb-4000]# 
[root@b26 tidb-4000]# 
[root@b26 tidb-4000]# grep -i oom tidb-2022-11-09T00-34-18.088.log  |more
[2022/11/08 20:17:39.510 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31310999552] ["tidb-server memory usage"=19779988200] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:17:51.116 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31315435520] ["tidb-server memory usage"=20917584008] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:18:02.945 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31339331584] ["tidb-server memory usage"=22594641656] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:18:15.052 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31360589824] ["tidb-server memory usage"=23886401472] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:18:36.624 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31433367552] ["tidb-server memory usage"=25614050064] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:18:48.469 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31401734144] ["tidb-server memory usage"=14070173040] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:19:00.565 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31400734720] ["tidb-server memory usage"=15317806152] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:19:13.193 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31428382720] ["tidb-server memory usage"=16562777040] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:19:24.745 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31451799552] ["tidb-server memory usage"=17912782440] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:19:36.860 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31475445760] ["tidb-server memory usage"=20011172064] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:19:48.782 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31427584000] ["tidb-server memory usage"=21546932872] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:20:00.006 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31404859392] ["tidb-server memory usage"=22523424224] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:20:11.863 +08:00] [WARN] [memory_usage_alarm.go:141] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=33533276160] ["system memory usage"=31412420608] ["tidb-server memory usage"=23954927736] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
[2022/11/08 20:20:56.659 +08:00] [INFO] [printer.go:47] ["loaded config"] [config="{\"host\":\"0.0.0.0\",\"advertise-address\":\"192.168.241.26\",\"port\":4000,\"cors\":\"\",\"store\":\"tikv\",\"path\":\"192.168.241.49:12379,192.168.241.26:12379,192.168.241.21:12379\",\"socket\":\"\",\"lease\":\"45s\",\"run-ddl\":true,\"split-table\":true,\"token-limit\":1000,\"oom-use-tmp-storage\":true,\"tmp-storage-path\":\"/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage\",\"oom-action\":\"log\",\"mem-quota-query\":8589934592,\"tmp-storage-quota\":-1,\"enable-streaming\":false,\"enable-batch-dml\":false,\"lower-case-table-names\":2,\"server-version\":\"\",\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":null,\"enable-timestamp\":null,\"disable-error-stack\":null,\"enable-error-stack\":null,\"file\":{\"filename\":\"/data/deploy/install/log/tidb-4000/tidb.log\",\"max-size\":300,\"max-days\":0,\"max-backups\":0},\"enable-slow-log\":true,\"slow-query-file\":\"log/tidb_slow_query.log\",\"slow-threshold\":300,\"expensive-threshold\":10000,\"query-log-max-len\":4096,\"record-plan-in-slow-log\":1},\"security\":{\"skip-grant-table\":false,\"ssl-ca\":\"\",\"ssl-cert\":\"\",\"ssl-key\":\"\",\"require-secure-transport\":false,\"cluster-ssl-ca\":\"\",\"cluster-ssl-cert\":\"\",\"cluster-ssl-key\":\"\",\"cluster-verify-cn\":null},\"status\":{\"status-host\":\"0.0.0.0\",\"metrics-addr\":\"\",\"status-port\":10080,\"metrics-interval\":15,\"report-status\":true,\"record-db-qps\":false},\"performance\":{\"max-procs\":0,\"max-memory\":0,\"server-memory-quota\":0,\"memory-usage-alarm-ratio\":0.8,\"stats-lease\":\"3s\",\"stmt-count-limit\":5000,\"feedback-probability\":0,\"query-feedback-limit\":512,\"pseudo-estimate-ratio\":0.8,\"force-priority\":\"NO_PRIORITY\",\"bind-info-lease\":\"3s\",\"txn-total-size-limit\":10737418240,\"tcp-keep-alive\":true,\"cross-join\":true,\"run-auto-analyze\":true,\"agg-push-down-join\":false,\"committer-concurrency\":16,\"max-txn-ttl\":600000,\"gogc\":100},\"prepared-plan-cache\":{\"enabled\":false,\"capacity\":100,\"memory-guard-ratio\":0.1},\"opentracing\":{\"enable\":false,\"rpc-metrics\":false,\"sampler\":{\"type\":\"const\",\"param\":1,\"sampling-server-url\":\"\",\"max-operations\":0,\"sampling-refresh-interval\":0},\"reporter\":{\"queue-size\":0,\"buffer-flush-interval\":0,\"log-spans\":false,\"local-agent-host-port\":\"\"}},\"proxy-protocol\":{\"networks\":\"\",\"header-timeout\":5},\"tikv-client\":{\"grpc-connection-count\":4,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"commit-timeout\":\"41s\",\"max-batch-size\":128,\"overload-threshold\":200,\"max-batch-wait-time\":0,\"batch-wait-size\":8,\"enable-chunk-rpc\":true,\"region-cache-ttl\":600,\"store-limit\":0,\"store-liveness-timeout\":\"5s\",\"copr-cache\":{\"enable\":false,\"capacity-mb\":1000,\"admission-max-ranges\":500,\"admission-max-result-mb\":10,\"admission-min-process-ms\":5}},\"binlog\":{\"enable\":false,\"ignore-error\":false,\"write-timeout\":\"15s\",\"binlog-socket\":\"\",\"strategy\":\"range\"},\"compatible-kill-query\":false,\"plugin\":{\"dir\":\"\",\"load\":\"\"},\"pessimistic-txn\":{\"enable\":true,\"max-retry-count\":256},\"check-mb4-value-in-utf8\":true,\"max-index-length\":3072,\"graceful-wait-before-shutdown\":0,\"alter-primary-key\":false,\"treat-old-version-utf8-as-utf8mb4\":true,\"enable-table-lock\":false,\"delay-clean-table-lock\":0,\"split-region-max-num\":1000,\"stmt-summary\":{\"enable\":true,\"enable-internal-query\":false,\"max-stmt-count\":200,\"max-sql-length\":4096,\"refresh-interval\":1800,\"history-size\":24},\"repair-mode\":false,\"repair-table-list\":[],\"isolation-read\":{\"engines\":[\"tikv\",\"tiflash\",\"tidb\"]},\"max-server-connections\":0,\"new_collations_enabled_on_first-bootstrap\":false,\"experimental\":{\"allow-expression-index\":false},\"enable-collect-execution-info\":true,\"skip-register-to-dashboard\":false,\"enable-telemetry\":true}"]
[2022/11/08 22:06:07.441 +08:00] [INFO] [2pc.go:822] ["prewrite encounters lock"] [conn=29] [lock="key: {tableID=2230, handle=68540032299}, primary: {tableID=2230, indexID=1, indexValues={123, http://www.t-boom.com.cn/wap/indeax.php?id=795213, }}, txnStartTS: 437234267549335567, lockForUpdateTS:0, ttl: 4321, type: Put"]
[2022/11/08 22:06:07.441 +08:00] [INFO] [2pc.go:822] ["prewrite encounters lock"] [conn=29] [lock="key: {tableID=2230, handle=68540043835}, primary: {tableID=2230, indexID=1, indexValues={123, http://www.t-boom.com.cn/wap/indeax.php?id=795213, }}, txnStartTS: 437234267549335567, lockForUpdateTS:0, ttl: 4321, type: Put"]
[root@b26 tidb-4000]#  
[root@b26 tidb-4000]#
| username: jaybing926 | Original post link

It is 0.8

| username: Kongdom | Original post link

You can change the log level to warn, and there will be much fewer logs.

| username: buddyyuan | Original post link

Check this path:
/tmp/1025_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record

First, see if there are any running SQLs and check if there are any abnormal SQLs consuming memory. If not, upload the heap.

| username: jaybing926 | Original post link

There are quite a few. Are the ones in running_sql abnormal SQLs? Under what circumstances will this be recorded?


This is the running_sql log file:
running_sql2022-11-08T20_19_13+08_00.txt (118.9 KB)

| username: buddyyuan | Original post link

Please upload the latest heap file again.

| username: jaybing926 | Original post link

How do I view this file? I can’t directly use more.
heap2022-11-08T20_20_11+08_00.txt (9.6 MB)

| username: buddyyuan | Original post link

Directly execute go tool pprof -http=127.0.0.1:8080 heap_file in the terminal, and then view it through the browser.

I feel that this issue is still at the SQL layer. There are too many values in the in condition of your SQL where clause.

| username: jaybing926 | Original post link

Okay, thank you for your patient explanation :pray:

| username: 裤衩儿飞上天 | Original post link

It still feels related to the GC time setting being too long. Has there been any improvement after adjusting it to 1 hour?

| username: jaybing926 | Original post link

Actually, I’m not quite sure what’s going on either. Previously, our GC time was always set to 1 month, and I asked the developers, there haven’t been any large data modifications recently. After I changed the GC time to 1 hour, the memory didn’t go down after the next GC, until TiDB OOM released the memory. My conclusion is that it’s an issue with SQL queries. There are indeed a lot of queries running in the background.

| username: 裤衩儿飞上天 | Original post link

Is this SQL from a recently launched business? How large is the table’s data volume? How is the table’s health? Do we need to analyze the table? There are indeed many values in the IN clause in your SQL. You can create a composite index to avoid going back to the table to check another value. After all, it’s just a count; let’s see if it has any effect.

| username: Jiawei | Original post link

So that’s how you look at it.