How to Identify the Cause of High Memory Usage in TiKV?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV 内存占用过高,如何定位原因?

| username: residentevil

[TiDB Usage Environment] Production Environment
[TiDB Version] V7.1.0
[Encountered Problem: Problem Phenomenon and Impact] In the absence of read requests, during the full data migration process using DTS, it was found that the memory of TIKV continuously increased [checked by logging into the TIKV server and using TOP]. By modifying the storage.block-cache.capacity configuration, the memory usage was reduced somewhat, but the issue is still not resolved. How can we identify what is occupying this memory?

| username: 有猫万事足 | Original post link

In the TiDB Dashboard Instance Performance Analysis - Manual Analysis Page, select heap.

* Heap: Memory usage overhead of various internal functions on TiDB and PD instances,

There are similar graphs like the one below:

For memory usage of each slow SQL,

Fields related to memory usage:

* `Mem_max`: Indicates the maximum memory space used by TiDB during execution, in bytes.
| username: MrSylar | Original post link

tikv-details → server → memory trace can identify which components are consuming memory.

| username: residentevil | Original post link

The information here is incomplete, I have checked it.

| username: MrSylar | Original post link

It seems that there is no good solution at the moment. TiKV should prioritize top SQL and focus on CPU.

| username: Billmay表妹 | Original post link

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page Let’s check your resource configuration.

| username: WalterWj | Original post link

TiKV memory is resident, and theoretically, without configuration, a single machine with a single instance will use 80% of the server’s memory and then will not increase further.

| username: residentevil | Original post link

During the full data load, TIKV experienced multiple OOMs. I noticed that the memory usage kept increasing, and there were no read requests.

| username: WalterWj | Original post link

TiKV generally won’t OOM. Either you have multiple instances deployed without memory configuration, or the memory configuration is unreasonable, or it’s a TiKV bug causing a memory overflow.

| username: 有猫万事足 | Original post link

My cluster configuration is very poor (4c8G), and I have encountered this situation. Several TiKVs took turns crashing, but after implementing resource control in version 7.1, it has been running smoothly.

| username: residentevil | Original post link

What is resource control?

| username: residentevil | Original post link

For security reasons, we have disabled access to the dashboard :sweat_smile:

| username: 有猫万事足 | Original post link

New feature in version 7.1.

| username: residentevil | Original post link

Some new features are a bit intimidating to use, haha.

| username: 有猫万事足 | Original post link

It’s not complicated to use. I’m not sure if it’s a configuration issue or a mixed deployment.

My issue is definitely due to insufficient configuration. The recommended configuration in the documentation doesn’t even cover such a low-end setup like mine. It’s impossible to use without this feature. This is just a record; you can take a look if you’re interested.

| username: residentevil | Original post link

Okay, thanks for your hard work. I’ll locate it again.

| username: residentevil | Original post link

Is there a detailed explanation for the parameter storage.block-cache.strict-capacity-limit? I couldn’t find it in the official documentation. Can we strictly limit TiKV memory usage by using this parameter along with the storage.block-cache.capacity parameter?

| username: 有猫万事足 | Original post link

Not very optimistic, but you can try it in a test environment. This parameter is just a true/false setting. It seems to be directly passed into the RocksDB settings.

Then, in RocksDB, I found this issue:

The gist is that if this parameter is set and the cache is full, it won’t be able to insert anymore and might immediately throw an error instead of pretending nothing happened and returning normally. This issue is still open. :joy:
Since RocksDB will clearly throw an error, it’s unclear whether TiKV has done anything to handle this situation.

| username: residentevil | Original post link

When I show the config for storage.block-cache.shared, it returns null. Does this value represent true or false? :sweat_smile:

| username: 有猫万事足 | Original post link

:rofl: This is a bit unfair.