Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tikv采集完数据后内存为什么一直不释放
[Test Environment for TiDB] Single machine simulation cluster
[TiDB Version]
[Reproduction Path]
[Encountered Issue: Phenomenon and Impact]
When using DataX to collect single table data into TiDB, testing with 50 million records results in OOM due to the large capacity setting. Therefore, I controlled it to around 30 million records, which can be collected normally. After data collection is completed, each TiKV occupies about 8GB of memory, and it does not release even the next day. Checking the TiKV logs shows no memory reclamation errors. If I need to continue collecting data from other tables, I must restart TiKV. Why does the memory not decrease after data collection is completed? Where is the memory used during data collection, and when will it be released?
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
As long as the value of storage.block-cache.capacity is set reasonably, the cached data in TiKV will be replaced by the latest data. However, your resources are too tight; it is recommended to use a standalone MySQL instead.
He is testing a single-machine simulated cluster. However, if you want to test performance, you should still use a cluster, as a single machine cannot reflect the true performance.
OOM indicates that your TiKV memory usage limit should have exceeded your maximum memory. Data usage will definitely be stored in the cache.
In a single-machine simulated cluster, if the storage.block-cache.capacity value is not modified, each of the 3 TiKV nodes will use 45% of the memory by default, which will definitely lead to OOM (Out of Memory) on a single machine.
TiKV will not release it. After you set the capacity, there is a maximum value. TiKV generally controls it not to exceed the maximum value, but it will not release the cached data anymore.
The part that doesn’t release memory is mainly the block cache of RocksDB.
The LSM tree data structure is write-friendly, turning random writes into sequential writes. However, it is not read-friendly, especially for range scans, which may need to scan many layers, causing read amplification.
The existence of the block cache is to balance the performance between reads and writes.
For the deployment environment, the memory performance is normal. The memory is not being released. You can use other methods.
sync;sync;sync # Write to disk to prevent data loss
sleep 20
echo 1 > /proc/sys/vm/drop_caches
echo 2 > /proc/sys/vm/drop_caches
echo 3 > /proc/sys/vm/drop_caches
Even if it is a non-single-machine simulated cluster deployment, when the amount of collected data is larger, won’t TiKV also get full? And once it is full, won’t it also affect the service?
There will be a limit, which is evaluated based on the host memory during installation. In a mixed deployment, if the memory usage is based on an independent TiKV deployment, it will naturally be insufficient, and the parameters need to be manually adjusted.
The page cache is turned off. After setting each TiKV to capacity=6g, I collected the remaining 30 million data. Each TiKV indeed did not exceed 6g, but today, whether it was adding an index to a large table or synchronizing incremental data, the TiKV memory reached 9g again, causing the server to crash. In the end, the index was not added, and restarting the cluster did not release the memory. I don’t know what is consuming this memory.
The upper limit for capacity=6 is roughly 9, this is just a cache configuration.