Troubleshooting TiDB OOM Issues

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb oom 故障排查

| username: zhimadi

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.2
[Reproduction Path] None
[Encountered Problem: Problem Phenomenon and Impact]
Help needed: Two TiDB nodes consecutively OOM, with actual physical machine memory usage at 100%. TiDB and PB are deployed on the same machine.
During non-peak business hours, OOM was suddenly triggered in a stable state.
Please advise on the ideas and steps for troubleshooting. Thank you!!
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Log/Monitoring]

Log:

[Fri Jun 14 08:24:59 2024] [ 429999] 1001 429999 4316303 3598108 30363648 0 0 tidb-server
[Fri Jun 14 08:24:59 2024] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/tidb-4000.service,task=tidb-server,pid=429999,uid=1001
[Fri Jun 14 08:24:59 2024] Out of memory: Killed process 429999 (tidb-server) total-vm:17265212kB, anon-rss:14392432kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:29652kB oom_score_adj:0
[Fri Jun 14 08:24:59 2024] oom_reaper: reaped process 429999 (tidb-server), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x28148d7]

tidb_stderr.log:
goroutine 1 [running]:
github.com/pingcap/tidb/ddl.(*ddl).close(0xc00082f180)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/ddl/ddl.go:399 +0x77
github.com/pingcap/tidb/ddl.(*ddl).Stop(0xc00082f180, 0x0, 0x0)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/ddl/ddl.go:327 +0x8a
github.com/pingcap/tidb/domain.(*Domain).Close(0xc000828140)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/domain/domain.go:695 +0x377
github.com/pingcap/tidb/session.(*domainMap).Get.func1(0x1000001685fc5, 0x7f10440521c8, 0x98)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/tidb.go:86 +0x69e
github.com/pingcap/tidb/util.RunWithRetry(0x1e, 0x1f4, 0xc001f07a60, 0x18, 0x6468280)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/util/misc.go:65 +0x7f
github.com/pingcap/tidb/session.(*domainMap).Get(0x642b450, 0x4538850, 0xc0001dbef0, 0xc000828140, 0x0, 0x0)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/tidb.go:71 +0x1f0
github.com/pingcap/tidb/session.createSessionWithOpt(0x4538850, 0xc0001dbef0, 0x0, 0x3e04200, 0xc000d205a0, 0xc000051980)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2767 +0x59
github.com/pingcap/tidb/session.createSession(...)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2763
github.com/pingcap/tidb/session.BootstrapSession(0x4538850, 0xc0001dbef0, 0x0, 0x0, 0x0)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2598 +0xfe
main.createStoreAndDomain(0x64312a0, 0x3ff6a97, 0x2c)
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:296 +0x189
main.main()
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:202 +0x29e

[2024/06/14 08:29:32.728 +08:00] [WARN] [memory_usage_alarm.go:140] [“tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path”] [“is server-memory-quota set”=false] [“system memory total”=16244236288] [“system memory usage”=13190623232] [“tidb-server memory usage”=9654420392] [memory-usage-alarm-ratio=0.8] [“record path”=“/tmp/1001_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record”]

| username: zhaokede | Original post link

Looking at the error log above, did it run out of memory (OOM) after a network transmission failure?

| username: h5n1 | Original post link

Where are the TiDB logs?

| username: zhimadi | Original post link

Added the error log, please take a look.

| username: zhimadi | Original post link

Do I need to check any other logs?

| username: tidb菜鸟一只 | Original post link

You can grep "expensive_query" in tidb.log

| username: zhimadi | Original post link

No relevant records found.

| username: FutureDB | Original post link

Check if there is an “oom” folder at the same level as tidb.log in your version. I remember there is an oom log file inside that records the SQL that caused the oom. You can take a look.

| username: 小于同学 | Original post link

Not entirely.

| username: TiDBer_vFs1A6CZ | Original post link

It is possible that a large SQL query caused the TiDB Server to run out of memory (OOM), which could severely lead to server downtime and the system freezing.

  • Limit the maximum memory usage for a single SQL query.
  • Protect TiDB Server nodes.
  • Optimize large SQL queries.
| username: lemonade010 | Original post link

Check the Prometheus monitoring records to see if the various metrics are normal and if there are any sudden spikes in values.

| username: zhimadi | Original post link

From the monitoring, the memory is relatively stable. No suspicious expensive queries were found in the logs either.

| username: lemonade010 | Original post link

Take a look at the view under this monitoring in tidb-performance-overview.

| username: zhimadi | Original post link

Which panel is this in? I can’t seem to find it in version 5.4.2.

| username: lemonade010 | Original post link

Check the monitoring, not the dashboard. Use tiup cluster display to see which machine it is deployed on.

| username: zhimadi | Original post link

In Grafana, but I don’t know which panel you’re referring to?

| username: lemonade010 | Original post link

The image link you provided does not contain any text that can be translated. Please provide the text you need translated.

| username: zhimadi | Original post link

It should not be the same version, it’s a bit different from yours.

| username: Kamner | Original post link

Reference:

| username: ziptoam | Original post link

You can try isolating the deployment of TiDB and PD first.