How to Troubleshoot the Specific Cause When a TiDB Machine Runs Out of Memory and Requires a Manual Restart

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb有个机器内存耗尽人工重启了,怎么排查具体原因

| username: zhanggame1

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 7.5.1

[Encountered Issue: Problem Phenomenon and Impact]
The original cluster was reinstalled to version 7.5.1, deployed using the original configuration file, and then lightning was used to import the backup data from before the reinstallation. After the import was completed, the cluster was restarted once more, and a cluster failure was discovered.

Failure Phenomenon:
The memory of the 19.207 server was exhausted and went offline.
HeapInuse- occupied a large amount of memory: 95.9 GiB

| username: madcoder | Original post link

Check the system logs?

| username: 路在何chu | Original post link

The large SQL query crashed. There are some logs.

| username: xiaoqiao | Original post link

Check the expensive logs in the dashboard to see if there are any SQL queries or other operations consuming a large amount of memory.

| username: zhanggame1 | Original post link

I restarted the machine before waiting for the OOM kill. Didn’t find anything too useful in the system logs.

| username: TiDBer_QYr0vohO | Original post link

Take a look at the logs.

| username: 舞动梦灵 | Original post link

Only two TiKV servers? TiDB, PD, and TiKV are all installed on the same server? Do you have only one replica? If it’s one replica, there shouldn’t be an issue with TiKV. If it’s three replicas, there will be a problem with TiKV. According to the official documentation, for three replicas, you need at least three TiKV servers.

| username: zhanggame1 | Original post link

There are 6 TiKV nodes with 3 replicas.

| username: caiyfc | Original post link

When the lightning import was completed, was everything normal? Then after restarting the cluster, you found that the TiDB node was using a lot of memory, so you manually restarted the machine, right? After restarting the machine, did you run any SQL, such as connecting to the business? Did anyone run SQL tests?

| username: zhanggame1 | Original post link

First, the import with Lightning was completed normally. Then, after restarting the cluster, one machine lost connection after 10 minutes. Upon checking, it was found that the memory usage was too high, causing it to freeze.

| username: caiyfc | Original post link

Has anyone executed SQL in the past ten minutes? It would be best to look for the keyword “expensive” in the logs of the problematic TiDB. If the memory increase is caused by SQL, this keyword will usually be present. If you can’t find it, it might be due to other reasons.

| username: DBRE | Original post link

Refer to

| username: DBAER | Original post link

First, determine which component is experiencing OOM, and install TiDB, TiKV.

| username: Hacker007 | Original post link

Your mix is indeed sometimes difficult to analyze. Check Grafana to see which component is using high memory.

| username: zhanggame1 | Original post link

The screenshot above shows that Heaplnuse- is occupying over 90GB.

| username: TIDB-Learner | Original post link

Hybrid deployment, optimize memory usage

| username: Jack-li | Original post link

You need to check the logs.

| username: zhang_2023 | Original post link

Take a look at the large SQL.