TiDB Triggers Frequent TiKV OOM After Deleting Large Amounts of Data

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB 删除大量数据后触发TIKV频繁OOM

| username: residentevil

【TiDB Usage Environment】Production Environment
【TiDB Version】V6.1.7
【Reproduction Path】Original cluster 70TB, after deleting all data through drop database, TIKV instances frequently encounter OOM issues
【Encountered Problem: Symptoms and Impact】
【Attachments: Screenshots/Logs/Monitoring】

| username: 像风一样的男子 | Original post link

Are there any errors in the KV logs?

| username: 大飞哥online | Original post link

Check the monitoring information. After deleting the data, it should run GC.

| username: Kongdom | Original post link

Check the TiKV node logs for any useful information.

| username: residentevil | Original post link

OOM does not record error logs, and the log levels of tikv are all in .log files, not placed in different logs according to different levels. :joy:

| username: residentevil | Original post link

The native RocksDB engine definitely wouldn’t have this kind of problem, so it seems unrelated to RocksDB. It might be related to GC. Is the GC thread single-threaded?

| username: Jellybean | Original post link

First, let’s not make any other assumptions. Please provide the logs and monitoring graphs of the cluster anomalies. We can confirm the specific issue and solution after diagnosis and evaluation.

| username: residentevil | Original post link

What monitoring is needed?

| username: Jellybean | Original post link

The main things to check are the logs of the nodes where the OOM issue occurs and the Grafana monitoring.

| username: residentevil | Original post link

[ERROR] [peer.rs:4976] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 13503490 store_id: 18”] [peer_id=13503491] [region_id=13503488] [type=MsgHibernateRequest]

| username: residentevil | Original post link

If it’s a performance issue with GC, are there any GC parameters that can be adjusted? Increasing the number of GC threads, etc.

| username: Hacker007 | Original post link

When querying, you can use set tidb_mem_quota_query=3607374182; to set it. This setting is at the transaction level.

| username: residentevil | Original post link

There are no read requests online, all have been stopped.

| username: TiDBer_小阿飞 | Original post link

GC not only increases threads, but also includes self-checks, execution intervals, and so on.

| username: 有猫万事足 | Original post link

My experience is that as long as the blockcache in TiKV is set correctly, it is not easy to encounter OOM. On the contrary, TiDB is a bit harder to control.

So I feel it’s hard to say it has anything to do with GC. It is more likely that the OOM is caused by mixed deployment and a large block-cache setting.

show config where name like 'storage.block-cache.capacity'

Check what this value is.

| username: residentevil | Original post link

This setting exceeds 20G.

| username: 有猫万事足 | Original post link

This is really strange. You already have 100G above. It’s possible that block-cache plus other usages might exceed a bit (4-5G?), but exceeding by several times is really outrageous. :sweat_smile:

| username: residentevil | Original post link

So, it is suspected to be a GC issue because the native ROCKSDB engine does not encounter such problems.

| username: Jellybean | Original post link

May I ask:

  • Are there other processes on the same machine using excessive memory?
  • Is it only one TiKV frequently experiencing OOM, or are multiple TiKVs experiencing OOM?
  • Is the OOM occurring periodically, or is there no time pattern?
  • Is there more than one TiKV instance deployed on a single machine? If so, what is the block cache parameter configured for each node?
  • Could you provide screenshots of the Grafana → TiKV Details → cluster → memory, Region, and CPU panels? We hope you can provide more information to help troubleshoot the issue.
| username: Fly-bird | Original post link

Organize the data, balance the regions, and then observe.