TiDB process crashes after consuming all system memory when the cluster is idle

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群无负载时 tidb 进程吃满系统内存后挂掉

| username: yiding-he

[TiDB Usage Environment] Production Environment
[TiDB Version] 7.3.0
[Reproduction Path]
After inserting about 400,000 records into a table, all SQL has been executed, and the database cluster is in an idle state with no requests.
[Encountered Problem: Phenomenon and Impact]
It was observed that on one server in the cluster, the CPU and memory usage of the tidb process were abnormally high. This phenomenon continued until the tidb process consumed all the system memory, eventually crashing and restarting itself.
[Resource Configuration]
The cluster consists of 5 machines. Since it is a production environment, IP addresses and other irrelevant information have been omitted in the content below.

Cluster type:       tidb
Cluster name:       0000
Cluster version:    v7.3.0
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://00.00.00.001:2379/dashboard
Grafana URL:        http://00.00.00.001:3000
ID                  Role          Host          Ports                            OS/Arch       Status   Data Dir                      Deploy Dir
--                  ----          ----          -----                            -------       ------   --------                      ----------
00.00.00.001:9093   alertmanager  00.00.00.001  9093/9094                        linux/x86_64  Up       /tidb-data/alertmanager-9093  /tidb-deploy/alertmanager-9093
00.00.00.001:3000   grafana       00.00.00.001  3000                             linux/x86_64  Up       -                             /tidb-deploy/grafana-3000
00.00.00.001:2379   pd            00.00.00.001  2379/2380                        linux/x86_64  Up|L|UI  /tidb-data/pd-2379            /tidb-deploy/pd-2379
00.00.00.001:9090   prometheus    00.00.00.001  9090/12020                       linux/x86_64  Up       /tidb-data/prometheus-9090    /tidb-deploy/prometheus-9090
00.00.00.001:4000   tidb          00.00.00.001  4000/10080                       linux/x86_64  Up       -                             /tidb-deploy/tidb-4000
00.00.00.002:4000   tidb          00.00.00.002  4000/10080                       linux/x86_64  Up       -                             /tidb-deploy/tidb-4000
00.00.00.003:4000   tidb          00.00.00.003  4000/10080                       linux/x86_64  Up       -                             /tidb-deploy/tidb-4000
00.00.00.004:4000   tidb          00.00.00.004  4000/10080                       linux/x86_64  Up       -                             /tidb-deploy/tidb-4000
00.00.00.005:4000   tidb          00.00.00.005  4000/10080                       linux/x86_64  Up       -                             /tidb-deploy/tidb-4000
00.00.00.002:9000   tiflash       00.00.00.002  9000/8123/3930/20170/20292/8234  linux/x86_64  Up       /tidb-data/tiflash-9000       /tidb-deploy/tiflash-9000
00.00.00.003:9000   tiflash       00.00.00.003  9000/8123/3930/20170/20292/8234  linux/x86_64  Up       /tidb-data/tiflash-9000       /tidb-deploy/tiflash-9000
00.00.00.004:9000   tiflash       00.00.00.004  9000/8123/3930/20170/20292/8234  linux/x86_64  Up       /tidb-data/tiflash-9000       /tidb-deploy/tiflash-9000
00.00.00.005:9000   tiflash       00.00.00.005  9000/8123/3930/20170/20292/8234  linux/x86_64  Up       /tidb-data/tiflash-9000       /tidb-deploy/tiflash-9000
00.00.00.001:20160  tikv          00.00.00.001  20160/20180                      linux/x86_64  Up       /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
00.00.00.002:20160  tikv          00.00.00.002  20160/20180                      linux/x86_64  Up       /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
00.00.00.003:20160  tikv          00.00.00.003  20160/20180                      linux/x86_64  Up       /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
00.00.00.004:20160  tikv          00.00.00.004  20160/20180                      linux/x86_64  Up       /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
00.00.00.005:20160  tikv          00.00.00.005  20160/20180                      linux/x86_64  Up       /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
Total nodes: 18

[Attachments: Screenshots/Logs/Monitoring]


| username: h5n1 | Original post link

Have you checked the slow SQL to confirm that there is no execution at all? Check the statistics collection time with SHOW ANALYZE STATUS to see if it matches.

| username: tidb菜鸟一只 | Original post link

It might really be an issue with collecting statistical information.

| username: 像风一样的男子 | Original post link

When the load is high, use top -Hp $(pidof tidb-server) on the server to see which thread is busy.

| username: yiding-he | Original post link

It is indeed executing the analyze operation:

It seems that this process is uncontrollable. The node is simultaneously deploying TiDB and TiKV, which might lead to memory contention. I am not sure how to handle this.

Additionally, this table has 1007 fields. I am not sure if the resource consumption of analyze is related to the table structure.

| username: zhanggame1 | Original post link

1007 fields are quite excessive; most people won’t encounter that.

| username: zhanggame1 | Original post link

If TiDB and TiKV are deployed simultaneously, component resource limitations should be applied according to the hybrid deployment method.
Hybrid Deployment Topology | PingCAP Documentation Center

Three-Node Hybrid Deployment Best Practices | PingCAP Documentation Center

| username: 大飞哥online | Original post link

SHOW VARIABLES LIKE ‘%analyze%’
View the parameters of statistics, and you can modify the time period for automatic statistics.

| username: 大飞哥online | Original post link

The error message indicates that the TiDB server is out of memory. You can try the following solutions:

  1. Increase the memory of the TiDB server.
  2. Optimize the SQL query to reduce memory usage.
  3. Adjust the TiDB configuration to limit memory usage, such as setting oom-use-tmp-storage to true and configuring tmp-storage-path and tmp-storage-quota.
  4. Split large transactions into smaller ones to reduce memory consumption.
| username: Fly-bird | Original post link

Did you find the reason?

| username: 大飞哥online | Original post link

Automatically ran statistics collection.

| username: yiding-he | Original post link

Thank you for the reminder, but after checking, I found that my current configuration complies with the hybrid deployment topology. Moreover, the configuration for hybrid deployment is targeted at the TiKV component, and there are no configuration items for the memory usage of the TiDB component.

| username: 大飞哥online | Original post link

Did you deploy TiDB and TiKV together?

| username: zhanggame1 | Original post link

TiDB memory limit configuration

| username: yiding-he | Original post link

Thank you for the reminder. For now, I’ll use this configuration to avoid memory overload. Although I still don’t understand why analyzing the table takes so long and consumes so much CPU, I’ll observe it for a while. If it doesn’t work out, I’ll have to disable its automatic analysis.

| username: 像风一样的男子 | Original post link

Are you using 7.3? This is a dev version and it’s unstable.

| username: 大飞哥online | Original post link

Going straight to 7.3 in production is impressive :grinning: