How to Identify the Cause of High Memory Usage in TiKV?

translator_bot · June 21, 2024, 10:43pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV 内存占用过高，如何定位原因？

| username: residentevil

[TiDB Usage Environment] Production Environment
[TiDB Version] V7.1.0
[Encountered Problem: Problem Phenomenon and Impact] In the absence of read requests, during the full data migration process using DTS, it was found that the memory of TIKV continuously increased [checked by logging into the TIKV server and using TOP]. By modifying the storage.block-cache.capacity configuration, the memory usage was reduced somewhat, but the issue is still not resolved. How can we identify what is occupying this memory?

translator_bot · June 21, 2024, 10:43pm

| username: 有猫万事足 | Original post link

In the TiDB Dashboard Instance Performance Analysis - Manual Analysis Page, select heap.

* Heap: Memory usage overhead of various internal functions on TiDB and PD instances,

There are similar graphs like the one below:

For memory usage of each slow SQL,

Fields related to memory usage:

* `Mem_max`: Indicates the maximum memory space used by TiDB during execution, in bytes.

translator_bot · June 21, 2024, 10:43pm

| username: MrSylar | Original post link

tikv-details → server → memory trace can identify which components are consuming memory.

translator_bot · June 21, 2024, 10:43pm

| username: residentevil | Original post link

The information here is incomplete, I have checked it.

translator_bot · June 21, 2024, 10:43pm

| username: MrSylar | Original post link

It seems that there is no good solution at the moment. TiKV should prioritize top SQL and focus on CPU.

translator_bot · June 21, 2024, 10:43pm

| username: Billmay表妹 | Original post link

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page Let’s check your resource configuration.

translator_bot · June 21, 2024, 10:43pm

| username: WalterWj | Original post link

TiKV memory is resident, and theoretically, without configuration, a single machine with a single instance will use 80% of the server’s memory and then will not increase further.

translator_bot · June 21, 2024, 10:43pm

| username: residentevil | Original post link

During the full data load, TIKV experienced multiple OOMs. I noticed that the memory usage kept increasing, and there were no read requests.

translator_bot · June 21, 2024, 10:43pm

| username: WalterWj | Original post link

TiKV generally won’t OOM. Either you have multiple instances deployed without memory configuration, or the memory configuration is unreasonable, or it’s a TiKV bug causing a memory overflow.

translator_bot · June 21, 2024, 10:43pm

| username: 有猫万事足 | Original post link

My cluster configuration is very poor (4c8G), and I have encountered this situation. Several TiKVs took turns crashing, but after implementing resource control in version 7.1, it has been running smoothly.

translator_bot · June 21, 2024, 10:43pm

| username: residentevil | Original post link

What is resource control?

translator_bot · June 21, 2024, 10:43pm

| username: residentevil | Original post link

For security reasons, we have disabled access to the dashboard

translator_bot · June 21, 2024, 10:43pm

| username: 有猫万事足 | Original post link

New feature in version 7.1.

translator_bot · June 21, 2024, 10:43pm

| username: residentevil | Original post link

Some new features are a bit intimidating to use, haha.

translator_bot · June 21, 2024, 10:43pm

| username: 有猫万事足 | Original post link

It’s not complicated to use. I’m not sure if it’s a configuration issue or a mixed deployment.

My issue is definitely due to insufficient configuration. The recommended configuration in the documentation doesn’t even cover such a low-end setup like mine. It’s impossible to use without this feature. This is just a record; you can take a look if you’re interested.

translator_bot · June 21, 2024, 10:43pm

| username: residentevil | Original post link

Okay, thanks for your hard work. I’ll locate it again.

translator_bot · June 21, 2024, 10:43pm

| username: residentevil | Original post link

Is there a detailed explanation for the parameter storage.block-cache.strict-capacity-limit? I couldn’t find it in the official documentation. Can we strictly limit TiKV memory usage by using this parameter along with the storage.block-cache.capacity parameter?

translator_bot · June 21, 2024, 10:43pm

| username: 有猫万事足 | Original post link

Not very optimistic, but you can try it in a test environment. This parameter is just a true/false setting. It seems to be directly passed into the RocksDB settings.

Then, in RocksDB, I found this issue:

github.com/facebook/rocksdb

Avoid exception if fail to insert block to block-cache

opened 08:56AM - 17 Aug 21 UTC

Myasuka

design discussion

### Expected behavior If fail to insert block to block-cache, just abot the ins…ert but return the result as normal. ### Actual behavior If we use default `ReadOptions` with `strict_capacity_limit` block cache, RocksDB would throw exception if `Insert failed due to LRU cache being full`. ### Steps to reproduce the behavior This is because current `RetrieveBlock` would [check the status of `MaybeReadBlockAndLoadToCache`](https://github.com/facebook/rocksdb/blob/add68bd28a512da751e2bdc612685fdeb7e6dde4/table/block_based/block_based_table_reader.cc#L1909-L1916): ~~~c s = MaybeReadBlockAndLoadToCache( prefetch_buffer, ro, handle, uncompression_dict, wait_for_cache, block_entry, block_type, get_context, lookup_context, /*contents=*/nullptr); if (!s.ok()) { return s; } ~~~ And if fail to insert, the block cache would report [incomplete status](https://github.com/facebook/rocksdb/blob/add68bd28a512da751e2bdc612685fdeb7e6dde4/cache/lru_cache.cc#L326-L339): ~~~c if ((usage_ + total_charge) > capacity_ && (strict_capacity_limit_ || handle == nullptr)) { e->SetInCache(false); if (handle == nullptr) { // Don't insert the entry but still return ok, as if the entry inserted // into cache and get evicted immediately. last_reference_list.push_back(e); } else { if (free_handle_on_fail) { delete[] reinterpret_cast<char*>(e); *handle = nullptr; } s = Status::Incomplete("Insert failed due to LRU cache being full."); } ~~~ As more and more applications move on cloud, and fine-granularity memory control is required in many use cases. I think enabling the strict capacity limit of block cache in production environment should be something valueable. How about this solution: 1. Introduce a new status as `FailInsertCache`. 1. Introduce a new filed named `fill_cache_if_possible` in ReadOptions, which means if cannot fill block to cache, we would not treat `FailInsertCache` status as a problem, and continue the read process as normal. What do you think of this problem and the propsed solution?

The gist is that if this parameter is set and the cache is full, it won’t be able to insert anymore and might immediately throw an error instead of pretending nothing happened and returning normally. This issue is still open.
Since RocksDB will clearly throw an error, it’s unclear whether TiKV has done anything to handle this situation.

translator_bot · June 21, 2024, 10:43pm

| username: residentevil | Original post link

When I show the config for storage.block-cache.shared, it returns null. Does this value represent true or false?

translator_bot · June 21, 2024, 10:43pm

| username: 有猫万事足 | Original post link

This is a bit unfair.