TiKV Memory Usage Keeps Increasing Slowly

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV内存一直缓慢上涨

| username: 搬砖er

【TiDB Usage Environment】Production Environment
【TiDB Version】4.0.2
【Reproduction Path】N/A
【Encountered Problem: Phenomenon and Impact】
Phenomenon: TiKV is used alone to store data (raw format) online, without using TiDB. Cluster size: 3 PD, 15 TiKV nodes.
After running for a period of time, the memory of each TiKV node slowly increases and exceeds the block-cache configuration value by a lot. The block-cache is configured for 20GB, but the actual TiKV memory usage is over 40GB. After switching the PD leader, the TiKV memory immediately recovers, but it continues to rise after a while. There are a large number of “leader changed” and “operator timeout” logs in the PD logs, as seen in the attachments.
How should this memory increase issue be resolved?

【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】
TiKV memory:

RocksDB block cache

PD monitoring

PD logs


TiKV memory flame graph
mem

| username: Billmay表妹 | Original post link

There are very few people using raw KV in the community; it is recommended to use TiDB.

There are very few practical scenarios in the community, so there are basically no community members who can help you solve your problems.

When you encounter issues, you will have to rely on yourself.

| username: Billmay表妹 | Original post link

Based on your description, the memory usage of TiKV nodes gradually increases over time and exceeds the block-cache configuration value. When switching the PD leader, the memory of TiKV immediately recovers, but it continues to rise again after a period. Meanwhile, there are many “leader changed” and “operator timeout” logs in the PD logs.

This situation may be due to the data write speed of the TiKV node exceeding its processing capacity, leading to continuous memory usage increase. To address this issue, you can consider the following optimizations:

  1. Adjust TiKV configuration parameters: You can try adjusting some configuration parameters of TiKV to improve its processing capacity and memory usage efficiency. For example, you can adjust the raftstore.apply-pool-size parameter to increase the concurrency of the application layer, or adjust the raftstore.store-pool-size parameter to increase the concurrency of the storage layer. Additionally, you can adjust parameters such as raftstore.raft-entry-max-size and raftstore.raft-log-gc-threshold based on the actual situation to reduce the size and cleanup frequency of Raft logs.

  2. Check for hotspot data and queries: Through monitoring and log analysis, determine if there are hotspot data and frequent queries. If there are hotspot data, you can consider using TiKV’s Region Split feature to distribute the hotspot data across multiple Regions to reduce the load on a single TiKV node. If there are frequent queries, you can optimize the query statements or increase the number of TiKV nodes to share the load.

  3. Check hardware resources: Ensure that the servers where TiKV nodes are located have sufficient hardware resources, including CPU, memory, and disk. Insufficient hardware resources may lead to performance degradation and increased memory usage of TiKV nodes. You can use monitoring tools to check the resource usage of the servers and upgrade or optimize the hardware as needed.

  4. Upgrade TiKV version: If you are using an older version of TiKV, consider upgrading to the latest stable version. Each version of TiKV includes performance and stability improvements that may help with the issues you are encountering.

Additionally, regarding the “leader changed” and “operator timeout” logs in PD, this may be due to heavy load on the PD cluster or network issues. You can check the resource usage and network connection status of PD nodes to ensure the PD cluster is running normally.

Finally, it is recommended to perform performance analysis and monitoring before making any adjustments to better understand the system bottlenecks and optimization directions.

I hope this information is helpful to you. If you have any further questions, please feel free to ask.

| username: 搬砖er | Original post link

Is it normal for the PD monitoring to show a large number of remove-extra-replica events when the TiKV cluster has not undergone any scaling operations?

| username: Jellybean | Original post link

The cluster itself will have such situations when performing hotspot scheduling and data balancing. However, if a large number of such situations occur, attention should be paid to whether there is a significant hotspot issue.

| username: hey-hoho | Original post link

It looks like regions are frequently undergoing elections, and there are also move peer timeouts. Please check if there are any communication issues between the TiKV nodes.

| username: 像风一样的男子 | Original post link

Have you installed Grafana monitoring? You can check the region scheduling status.

| username: Kongdom | Original post link

It looks like the leader is constantly switching.

| username: xfworld | Original post link

  1. The version is a bit outdated; if conditions permit, consider upgrading as a priority.
  2. It could be an issue caused by a known bug, but it’s too old to verify.
  3. Since TiDB is not used, you need to manually trigger compaction to free up space. You can check if this has been done.
  4. Frequent leader elections—could there be an environmental issue causing this? It is recommended to check the network.
| username: 我是咖啡哥 | Original post link

Awesome, you don’t need TiDB either.
The official documentation doesn’t provide this usage, right? Did you figure this out by studying the source code? What are the advantages of using it this way? Or what is the purpose behind it? I hope the expert can enlighten us!

| username: zhanggame1 | Original post link

Try upgrading to a newer version.

| username: Kongdom | Original post link

:yum: Currently, bare TiKV is supported, but bare TiFlash is not yet supported. Looking forward to bare TiFlash.

| username: 搬砖er | Original post link

Business requirements do not need SQL.

| username: 搬砖er | Original post link

Upgrading requires restarting TiKV. Currently, there is an issue where the leader does not migrate back to the restarted node after TiKV restarts, so we are hesitant to upgrade.

| username: xfworld | Original post link

The region leader can be manually specified, and this operation is not complicated. But the question is: why do you have to specify the region leader? Why not let the system automatically schedule it?

| username: 搬砖er | Original post link

Without manual specification, after the TiKV node restarts, the system’s automatic scheduling to balance the leaders across nodes seems to be ineffective.

| username: xfworld | Original post link

Then it needs to be checked. Another issue is that the speed of resource balancing between nodes is greatly related to the scheduling capability…

Or, locate whether the scheduling is being executed…

| username: 路在何chu | Original post link

The version is too low, upgrade it. There’s nothing to query either.

| username: 路在何chu | Original post link

Even if you set up a TiDB node, it’s fine. You can leave it there unused.

| username: 芮芮是产品 | Original post link

Almost no one understands your requirements.