TiDB TiKV Node Memory Keeps Growing Until OOM, Limited Size Gets Killed, Continues to Grow After Restart

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb tikv 节点内存不停增长到oom 限制大小被kill 重启后继续增长

| username: TiDBer_ZsnVPQB4

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.0
[Reproduction Path] Previously v5.2.1 version, after upgrading to v6.5.0, the memory of the TiKV node keeps growing until it is killed by the system due to OOM, and then the process repeats.
[Encountered Problem: Phenomenon and Impact]
Previously v5.2.1 version, after upgrading to v6.5.0, the memory of the TiKV node keeps growing until it is killed by the system due to OOM, and then the process repeats.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
tidb-dashboard


zabbix monitoring


THP is disabled by default
Adjusting gc_life_time from 24h to 2h is useless
All are default parameters, only some memory control-related parameters are set for the TiDB node
TiKV memory parameter optimization has also been checked,
shared=true by default
rockdb storage.block-cache.capacity default
TiDB calculates its own total memory of 32G, calculated as 14.4G, which should be 32*45%
write_buffer_size 128M default value

| username: db_user | Original post link

Check the analyze-related configurations and the execution status of analyze:

show variables like '%analyze%';
show analyze status;

Check the dashboard to see if there are a large number of slow queries during this period.

| username: TiDBer_ZsnVPQB4 | Original post link

Sorry, I can’t translate images. Please provide the text you need translated.

| username: db_user | Original post link

It seems that there is no problem with the analyze. You can check the slow logs. It might be that some execution plans are incorrect due to the upgrade. You can see which statements caused the memory surge and test these slow SQLs. The optimizer logic may differ between versions. Some SQLs might take more than ten seconds to execute, but using hash join incorrectly on large tables can also cause significant memory consumption. You can check the slow SQLs during the memory surge period and investigate them one by one.

| username: TiDBer_ZsnVPQB4 | Original post link

This TiDB cluster has many data insertion tasks (pulling data from TDengine for insertion) and some OLAP tasks. Most slow SQL queries are under 10 seconds, but occasionally there are queries that exceed 10 seconds.

| username: tidb菜鸟一只 | Original post link

Try adjusting grpc-memory-pool-quota, but this can only be an emergency measure in case of OOM. TiKV 配置文件描述 | PingCAP 文档中心

| username: tidb狂热爱好者 | Original post link

  • First, check the process status and logs of tidb-server to see if there are any errors or panic information.
  • Then, check the process status and logs of pd-server to ensure that pd-server has successfully started and the corresponding port is open: lsof -i:port.
  • If pd-server is normal, you need to check the connectivity between the tidb-server machine and the pd-server corresponding port, ensuring that the network segment is connected and the corresponding service port has been added to the firewall whitelist. You can use tools like nc or curl to check.
  • Finally, you can use the MySQL client to connect to tidb-server to see if SQL statements can be executed normally.
    In fact, most OOM alerts mean that your code has definitely exceeded the pressure of the TiDB database.
| username: TiDBer_ZsnVPQB4 | Original post link

The tiup cluster display command shows everything is normal. I have checked the show processlist after connecting with mysql -uroot -p, and the queries, inserts, and updates are all being executed. The cluster is functioning normally. The key issue is that there were no problems before the upgrade, but issues have arisen after the upgrade.

| username: weixiaobing | Original post link

It might be caused by the execution plan cache. You can try disabling tidb_enable_prepared_plan_cache.

| username: tidb狂热爱好者 | Original post link

The gap between the 5.4 and 6.0 series is quite large. The code has been completely rewritten.

| username: Jellybean | Original post link

There are generally two situations where TiKV node memory OOM occurs:

  • The TiKV block cache is set too large.
  • The coprocessor caches data fetched from TiKV in TiKV’s memory relatively quickly, while gRPC sends the read data to the TiDB server at a slower pace, leading to data accumulation and OOM.

For the above two situations, you can check whether the TiKV block cache parameters are reasonable.
If the parameters are reasonable, then investigate the SQL queries accessing the cluster at that time. You can check through the dashboard, or log into the machine to find slow query logs in the TiDB server, and grep for expensive queries in the tidb.log. In most cases, you will find the corresponding SQL.

Then look at the execution plan of the corresponding SQL, analyze the issue, and find the appropriate handling strategy.

| username: TiDBer_ZsnVPQB4 | Original post link

Closed it, no effect. It should not be the reason.

| username: Min_Chen | Original post link

Hello,

Can you capture the debug information before it runs out of memory?

| username: TiDBer_ZsnVPQB4 | Original post link

How exactly should this be done? I also want to see what specifically is more frequent. Besides block-cache, write-buffer, execution plan cached, etc., this library has more frequent replace updates.

| username: TiDBer_ZsnVPQB4 | Original post link

The TiDB node does not have OOM, but the TiKV node keeps experiencing OOM. The command curl http://127.0.0.1:10080/debug/zip --output tidb_debug.zip is used to analyze the TiDB node, right?

| username: db_user | Original post link

It seems that I know what the situation is. Execute the following operation on several TiDBs and check the logs:

cat tidb.log | grep 'error occurred when read table stats' | awk -F 'table=' '{print $2}' | awk -F ']' '{print $1}' | sort -u

If there are no surprises, there will be some tables. These tables cannot collect statistics due to the “data too long” bug. If this is the problem, solve it according to the following operation:
analyze bug - TiDB 的问答社区.

| username: TiDBer_ZsnVPQB4 | Original post link

The first node’s tidb.log does not have it.


The second node’s tidb.log does not have it either.

There are not many ERROR logs in tidb.log.

There are quite a few WARN logs like the one below. The connection to a bunch of machines behind the cloud SLB is being reset. The architecture is that the TiDB nodes are connected behind the SLB. These machines behind the SLB are transparent to the user, and users cannot see or connect to these machines.

| username: db_user | Original post link

That might not be it, but according to your diagnostic data, the stats phase also takes up a lot, which seems to be influenced by analyze. It’s quite strange.

| username: wuxiangdong | Original post link

Is there a problem with the GC settings?

| username: TiDBer_ZsnVPQB4 | Original post link

There are several large tables, one is 500G, one is 300G, and one is 100G. The stats you mentioned taking up a lot of space might be related to this. Later, we can see if turning off the collection of statistics can help reclaim the memory.