Issue of TiKV Node Memory Continuously Increasing Due to CDC Synchronization of Billion-Row Table

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 百亿表CDC同步致TiKV节点内存一直涨问题

| username: jiaxin

Background

A certain TiDB cluster with high TPS has two tables (one with 22.6 billion rows and the other with 2.1 billion rows) being synchronized by CDC (on the 6th night, archiving the 22.6 billion table caused the TiKV machine’s memory to increase by 5G). The TiKV machine’s memory usage has been steadily increasing over the past 30 days.

Memory usage of a certain TiKV storage node machine

Analysis

Version Information

TiDB Cluster: Cluster version: v5.1.4

CDC: Release Version: v5.1.4

TiKV machine configuration: 3.7T NVME disk

Parameter Configuration

  • Each TiKV storage node’s block-cache is set to 12G (64G memory machine)

  • Storage node machines are independently deployed with TiKV, and TiKV nodes themselves have a memory usage rate of 76%

  • Transparent huge pages on TiKV storage node machines have always been disabled

cat /sys/kernel/mm/transparent_hugepage/enabled

always madvise [never]

Characteristics of TiDB CDC Synchronized Tables

  • Stream type with high TPS writes in a short time, more writes and fewer reads (peak period inserts 40,000 to 50,000 data per second)

  • Batch inserts of 100 to 200 rows each time

  • On the 6th, CDC wrote to the downstream data change rows up to about 42,000, and on the 7th, after migrating CDC to a high-memory machine, it wrote to the downstream data change rows up to 78,000

Main Memory Components of TiKV Process

Mainly block_cache, but currently, TiKV block_cache is 12G, far from reaching the 50G memory usage of the TiKV process.

Grafana monitoring of CDC-TiKV found that other TiKV nodes’ CDC-related components consume a large portion of memory:

  • process_resident_memory_bytes-tikv_engine_block_cache_size_bytes (each TiKV node consumes about 34G of memory), a large portion of the 50G memory consumption of the TiKV process is non-block_cache consumption, i.e., TiKV CDC components

Non-block_cache memory consumption of TiKV nodes

  • tikv_cdc_sink_memory_bytes (very small memory)

Size of the old value cache

  • tikv_cdc_old_value_cache_bytes (very small memory)

Size of the CDC change event cache waiting to be sent in TiKV

Speculation

The issue of TiKV machine memory continuously increasing is speculated to be due to the large memory consumption by the TiKV CDC components. The question is why the memory consumption of the TiKV CDC components keeps increasing and whether there is an online method to release the memory of the TiKV CDC components.

专栏 - TiCDC 架构和数据同步链路解析 | TiDB 社区 TiCDC Architecture and Data Synchronization Link Analysis

专栏 - TiKV主要内存结构和OOM排查总结 | TiDB 社区 Summary of TiKV Main Memory Structure and OOM Troubleshooting

TiCDC 简介 | PingCAP 文档中心 TiCDC Overview

| username: Lucien-卢西恩 | Original post link

At present, it is speculated that the TiCDC component may occupy a large amount of memory based on known phenomena. Can we use TiKV to capture a flame graph to confirm the memory usage of TiCDC?
Method 1: 工欲性能调优,必先利其器(2)- 火焰图 | PingCAP
Method 2: Use TiDB Dashboard to capture the flame graph of TiKV memory consumption through profiling TiDB Dashboard 实例性能分析页面 | PingCAP 文档中心

| username: jiaxin | Original post link

The TiKV flame graph has been uploaded.
mem

| username: Lucien-卢西恩 | Original post link

From the flame graph, the CDC memory consumption is relatively low,


The memory usage of the unified pool reaches 20%

You can check and troubleshoot slow query SQL.