TiKV Memory Usage Keeps Increasing

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.1
[Reproduction Path]
1> Disable transparent huge pages,
2> Set storage.block-cache.capacity = 48G,
3> memory-usage-limit = 82G
[Encountered Problem: Phenomenon and Impact]
TiKV memory is limited to 82G, but observed cluster memory has already grown beyond 90G.

Why is the memory limit condition not taking effect?
Did you reload?

Is it at the GLOBAL level? tidb_server_memory_limit
Additionally, after setting this variable, when the memory usage of the tidb-server instance reaches 32 GB, TiDB will sequentially terminate the SQL operations with the highest memory usage until the memory usage of the tidb-server instance drops below 32 GB. The forcibly terminated SQL operations will return an error message Out Of Memory Quota! to the client.

Normally, you don’t need to set memory-usage-limit; you only need to set storage.block-cache.capacity. The maximum memory usage of TiKV will be limited to 5/3 * storage.block-cache.capacity. Execute SHOW config WHERE NAME LIKE ‘%storage.block-cache.capacity%’; to check if the storage.block-cache.capacity settings have taken effect.

All have taken effect, 49G.

The unlimited memory growth is related to TiKV, not TiDB.

Configure the limit parameters.

Are you looking at the total memory of the TiKV node? How much memory is the TiKV process occupying on the server?

Check the Prometheus monitoring for TiKV process memory usage. If the memory usage keeps increasing, there might be a memory leak issue. Additionally, you can check the TiKV log files to see if there are any warning or error logs indicating the presence of a memory leak.

The memory of TiKV is mainly controlled by the parameter storage.block-cache.capacity, which determines the size of the entire block-cache. This parameter affects RocksDB reads.

Then there is write-buffer-size, and there might be several write-buffers. This parameter affects RocksDB writes.

You can check the number and size settings through the config:
show config where name like '%write%buffer%' and type='tikv';

These two sets of parameters are actually set on RocksDB. Besides RocksDB, TiKV also has a Rust shell, which also occupies a certain amount of memory.

To put it simply, the Rust part’s memory cannot be controlled. In RocksDB, if you blindly reduce the block-cache, you can definitely achieve the effect of controlling memory. For example, in my 4c8g setup, the original block-cache was 3.5g. Reducing it to 2g can prevent TiKV memory alarms. Otherwise, memory usage above 80% would trigger constant alarms.

So, if you want to ensure memory usage is below 80%, you might as well set the block-cache to below 20g, and then gradually adjust it to your ideal level as the memory usage decreases.

Could you please advise what indications of memory leaks might appear in the TIKV logs?

You can check the Error level error information in the log file to see if there are any OOM-related errors, or use the log module in the dashboard panel to search online for all error level information on the TiKV nodes.

Under memory constraints

Under the condition of limiting tikv storage.block-cache.capacity, what could cause memory leaks?

The resolved-ts module has a known issue causing continuous memory growth, see Resolver memory is not reclaimed and may cause OOM · Issue #15458 · tikv/tikv · GitHub

If you don’t use stale read, you can disable the resolved-ts module

enable = false