TiDB 6.1 Automatic Restart Due to OOM After 128GB Memory Exhaustion

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB6.1 自动重启,查看128G内存使用完,导致OOM

| username: Hacker_g3b9VBO9

Production environment TIDB Cluster version: v6.1.1

View monitoring

| username: Billmay表妹 | Original post link

Based on the information you provided, it might be a memory leak in the TiDB Server causing OOM, or high memory usage in TiKV causing OOM. It is recommended to troubleshoot through the following steps:

  1. Check the memory usage of the TiDB Server. You can use the top or htop command to view the memory usage of the TiDB Server process. If you find that the memory usage of the TiDB Server process is too high, you can collect a flame graph using the command curl -G http://{TiDB_IP}:10080/debug/zip?seconds=30" > profile.zip to see the specific location of memory consumption.

  2. Check the memory usage of TiKV. You can use the top or htop command to view the memory usage of the TiKV process. If you find that the memory usage of the TiKV process is too high, you can collect heap memory information using the command curl -G http://{TiKV_IP}:20180/debug/pprof/heap > heap to see the specific location of memory consumption.

  3. Check the logs of TiDB Server and TiKV to see if there are any abnormal error messages, such as OOM-related error messages.

  4. Check the configuration files of TiDB Server and TiKV to confirm whether improper configuration is causing high memory usage.

  5. Check the system logs of the machines where TiDB Server and TiKV are located to see if there are any system-level OOM error messages.

I hope the above information can help you solve the problem. If you have any other questions, please feel free to ask.

| username: Hacker_g3b9VBO9 | Original post link

| username: Billmay表妹 | Original post link

It looks like the resources are a bit insufficient~

Check the documentation and make some adjustments.

| username: Hacker_g3b9VBO9 | Original post link

Take a closer look, it’s not that the CPU is insufficient, but that the memory is exhausted, and the data distribution is uneven.

| username: Billmay表妹 | Original post link

Refer to this article:

You can choose to optimize or add configurations~

| username: Billmay表妹 | Original post link

You can also take a look at this article.

| username: wakaka | Original post link

Single-machine multiple instances need to set memory limits, right? What are the current memory parameters?

| username: Hacker_g3b9VBO9 | Original post link

Adjusted

| username: jaybing926 | Original post link

I once encountered a situation where TiKV frequently experienced OOM and restarted. At that time, TiKV was bound to specific CPU cores, and the memory allocated to those cores was less than the maximum memory configured by TiKV by default, which caused the OOM.

Is it the same for you? Did you bind the cores?

| username: Hacker_g3b9VBO9 | Original post link

No binding core operation was performed, it’s the default.

| username: buddyyuan | Original post link

Take a look at this panel.