Tidb-server OOM Keeps Restarting

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb-server OOM 不停的重启

| username: porpoiselxj

[TiDB Usage Environment] Testing
[TiDB Version] V7.1.1
[Reproduction Path]
Not sure if it can be reproduced
[Encountered Issue: Problem Phenomenon and Impact]
Around 22:00 on November 14th, nearly 30 tables were imported using tidb-lightning, with a total data volume of less than 100 million. After the import was completed, the tidb-server started to OOM and kept restarting in a loop. No errors were found in the logs.
The heatmap shows a significant increase in data reads for system tables starting with stats_ after the data import.
No slow queries were found. The current cluster is not yet officially open, and there are only simple insert, delete, and update operations with very few queries.

[Resource Configuration]
Three hosts, 64 cores, 128G, mixed deployment, one pd, one kv, and one server per machine


| username: wenyi | Original post link

My deployment method is similar to yours, and my virtual machine configuration is worse than yours. I imported hundreds of tables with a total data volume of around 1TB, and the largest table has more than 300 million rows. I didn’t encounter the problem you’re describing.

| username: wenyi | Original post link

Are you running some SQL statements? Without running SQL statements, the TiDB server can’t possibly run out of memory (OOM).

| username: 芮芮是产品 | Original post link

Post the architecture diagram.

| username: 芮芮是产品 | Original post link

The reason is mixed deployment. TiDB and TiKV are competing for memory, causing continuous OOM (Out of Memory) issues.

| username: 芮芮是产品 | Original post link

You should deploy them separately. TiDB by default occupies all the memory of the machine, and so does TiKV. Whichever has more memory will cause an OOM (Out of Memory) error.

| username: zhanggame1 | Original post link

For hybrid deployment, configure the memory limit properly, and then check if there are any slow queries that particularly consume a lot of memory.

| username: WalterWj | Original post link

Try upgrading to 7.1.2 and see if that works.

If it still doesn’t work, you’ll need to capture a flame graph and consult the R&D team in the feedback section.

| username: wzf0072 | Original post link

SHOW variables like ‘%oom%’; Let’s see how the parameter tidb_mem_oom_action is set.
We previously set it to log, and the TiDB Server frequently restarted. After setting it to CANCEL, it was fine (SQL execution is interrupted when it exceeds tidb_mem_quota_query).
set global tidb_mem_oom_action=‘CANCEL’

| username: 芮芮是产品 | Original post link

tidb_mem_oom_action

| username: 芮芮是产品 | Original post link

tidb_mem_oom_action=false

| username: xingzhenxiang | Original post link

Let’s take a look at the slow queries.

| username: wzf0072 | Original post link

Did TiKV’s memory usage exceed the limit during the failure period?
Additionally, check the slow queries from the TiDB Dashboard and view the SQL memory usage during the failure period in descending order of memory usage.

| username: xingzhenxiang | Original post link

I used to have slow SQL and high memory usage. Can you check if the high memory usage SQL has a high frequency? Additionally, you can try to optimize it.

| username: tidb菜鸟一只 | Original post link

After the Lightning import is completed, it is necessary to collect statistics for the imported table. If the table is relatively large, it is indeed possible to encounter an OOM (Out of Memory) issue.

| username: porpoiselxj | Original post link

Already tried collecting, but it had no effect.

| username: porpoiselxj | Original post link

Before importing 32 new tables yesterday, the cluster already had 8TB of data and no anomalies were found. The problem appeared after the new data was imported yesterday.

| username: porpoiselxj | Original post link

After reviewing the monitoring, I found that the slow queries are all querying system tables, which is consistent with the flame graph results above.

| username: porpoiselxj | Original post link

I have already tried to set the memory usage limits for both the server and KV, but it didn’t help.

| username: porpoiselxj | Original post link

This parameter was set to cancel a long time ago.