TiDB server compute node memory is exhausted in a very short time, causing repeated automatic restarts

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb server计算节点内存在很短时间内被耗尽,反复自动重启。

| username: TiDBer_9lEfntcI

【TiDB Usage Environment】Production Environment
【TiDB Version】V5.4.3
【Reproduction Path】After setting gc_life_time from 10m to 48h, this phenomenon occurred two days later: the memory of the TiDB server computing node was exhausted in a very short time, causing repeated automatic restarts.
【Encountered Problem: Problem Phenomenon and Impact】
Problem Phenomenon: After restarting TiDB, when there is a high volume of user access, the system runs slowly.
Impact: The system cannot be used normally, queries are extremely slow, and query times are hundreds of times longer than before.
【Resource Configuration】

【Attachments: Screenshots/Logs/Monitoring】
lQLPJxc_lMOKmBPM980CX7DHqzMXhN1k3gV5kbGz2-8A_607_247
lQLPJw9BN_isGVPMuM0CSbDKA4dqjf5yfAV5kZn6qJQA_585_184

| username: Jasper | Original post link

Increasing the GC time, if the cluster deletes a lot of data, may cause a large accumulation of MVCC versions. You can check some large SQLs and look at the difference between process_key and total_key to confirm if this is the issue.

| username: 像风一样的男子 | Original post link

There is an obvious accumulation of slow queries. Optimize the SQL.

| username: 小龙虾爱大龙虾 | Original post link

First, check the slow queries.

| username: 连连看db | Original post link

The GC setting is too large, isn’t it? If there is no GC for a long time, the queries will slow down.

| username: zhanggame1 | Original post link

If there are too many updates, setting the GC time too long is not suitable, as it will significantly slow down query execution speed.

| username: TiDBer_9lEfntcI | Original post link

So, may I ask if TiDB provides any mechanisms to solve the issue of queries becoming slower?

| username: TiDBer_9lEfntcI | Original post link

So, are there any solutions?

| username: TiDBer_9lEfntcI | Original post link

Hello, so how can this problem be solved? Do you know the solution to this problem?

| username: TiDBer_9lEfntcI | Original post link

Are there any immediate solutions? Optimizing SQL statements is a long-term solution.

| username: TiDBer_9lEfntcI | Original post link

Is there a solution for this slow query?

| username: zhanggame1 | Original post link

Since it is related to GC time, why don’t you try adjusting it back?

| username: TiDBer_9lEfntcI | Original post link

We have adjusted it, but the problem still exists. We also operated the GC throttling, but there was no improvement.

| username: 像风一样的男子 | Original post link

GC being only 48 hours is not the main reason. My cluster is set to 72 hours without any issues. You should focus on optimizing your SQL instead.

| username: 连连看db | Original post link

  1. Setting the GC to 48 hours caused the memory to rise quickly and frequent restarts. After adjusting it back, will the cluster return to normal and stop restarting?
  2. After restarting the cluster, queries are exceptionally slow when the traffic is high. Have you compared the request volume, business version, and monitoring metrics before and after the restart?

To pinpoint the issue, you still need to look at the monitoring, especially those related to TiKV, such as region and Coprocessor. If you want to rule out GC, you can check the GC-related metrics, such as GC tasks duration.

| username: 小龙虾爱大龙虾 | Original post link

Bring in your business development team and the DBA, and optimize it together.

| username: Jellybean | Original post link

For your issue, see if you can reduce the GC time, for example, restore it to the default 10 minutes. Observe if the cluster’s performance improves.

Then, analyze slow queries, starting with troubleshooting and handling from the dashboard.

| username: tidb菜鸟一只 | Original post link

Directly go to the slowest SQL section and check if the delete_skipped_count is high. If it is, it means a lot of data has been deleted, and the GC not cleaning up could be causing a large amount of data to be scanned.

| username: oceanzhang | Original post link

Increase the frequency of GC.

| username: andone | Original post link

Try restoring the GC configuration.