Memory Limitation Issue in TiKV

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv内存限制问题

| username: zhanggame1

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] 7.5
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Problem Phenomenon and Impact]
3 physical machines with 128GB memory, 3 PD, 3 TiDB, 6 TiKV mixed deployment

Running stress test with tpcc:
tiup bench tpcc -H -P 4000 --db test --warehouses 20 --threads 500 --time 10m

Found that TiKV memory usage is very high, eventually causing the machine to freeze

Checked some information and found that there are two parameters storage.block-cache.capacity and memory-usage-limit that can control TiKV memory. How are these two parameters generally set, and if memory usage exceeds the limit, how should TiKV handle and control the memory?

| username: zhanggame1 | Original post link

Finally, I found another issue. After the stress test ended, the memory usage of TiDB quickly decreased, but the memory usage of TiKV did not release much.

| username: Miracle | Original post link

The memory release in TiDB is likely because TiDB does not cache data; it releases the memory once the computation is completed. On the other hand, TiKV does cache data, and after testing, the data remains in memory and is not cleared or released. If TiKV’s memory is insufficient to cache new data, it will clear some old data and then cache the new data. After running for a while, TiKV should maintain a stable memory usage.

| username: Jellybean | Original post link

The usage of memory-usage-limit is related to the deployment architecture of the cluster and usually does not require additional settings. The parameter that needs to be focused on is block-cache.capacity.

This parameter value can typically be set using the following formula:
storage.block-cache.capacity = (MEM_TOTAL * 0.5 / number of TiKV instances)

Once this value is set, TiKV will always occupy the memory specified by this value, using the LRU method to update the cache. Because of this additional layer of caching, the efficiency of accessing TiKV for query tasks can be greatly improved.

If you still find this value too large, you can appropriately reduce the block-cache.capacity value. This value can be dynamically adjusted online without restarting the TiKV cluster. After modifying the TiKV configuration item online, the system will automatically update the TiKV configuration file. However, you still need to use the tiup edit-config command to modify the corresponding configuration item; otherwise, operations like upgrade and reload will overwrite the results of the online configuration changes.

| username: zhanggame1 | Original post link

If a physical machine has 2 TiKVs, according to the memory-usage-limit chapter, if the TiKV memory is controlled at 40G, I estimate the block-cache.capacity to be 24G. I’m not sure if this is correct.

| username: oceanzhang | Original post link

TiKV uses a first-in, first-out (FIFO) approach and has a concept of a pool.

| username: zhanggame1 | Original post link

Setting block-cache.capacity to 24G for stress testing, TiKV memory usage grows to very close to 30G and then stops growing.

30/24=1.25, it can be considered that the maximum memory of TiKV is 1.25 times the block-cache.capacity, which is 5/4.

| username: dba远航 | Original post link

block-cache.capacity is just a part of the memory, there are other components.

| username: zhanggame1 | Original post link

Once block-cache.capacity is set, the upper limit of TiKV memory usage is strictly controlled.

| username: wangccsy | Original post link

Do a good job with memory management, don’t try to limit memory usage.

| username: 有猫万事足 | Original post link

memory-usage-limit = block-cache.capacity + write-buffer-size * max-write-buffer-number

I roughly went through the code before. The memory-usage-limit is roughly checked this way.

Generally speaking, lowering the block-cache will definitely help control it.

| username: zhanggame1 | Original post link

Recently, I tested on version 7.5, with three TiKV instances deployed on one machine. The block-cache.capacity setting is 1.25 times the maximum memory usage of TiKV.

| username: tidb菜鸟一只 | Original post link

For three physical machines with 128GB memory each, deploying 3 PD, 3 TiDB, and 6 TiKV in a mixed setup, if each physical machine has 1 PD, 1 TiDB, and 2 TiKV, my suggestion is to set block-cache.capacity to around 15GB, calculated as 128/4*0.45. Setting it too high may affect other components.

| username: zhanggame1 | Original post link

Finally, I gave 24, the maximum value will not exceed 30. Subtracting the 60 occupied by 2 TiKV from 128 still leaves half of the memory available, so there should be no problem.