TiKV memory keeps growing and eventually OOM

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 内存一直增长 最终oom

| username: TiDBer_ZsnVPQB4

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.0
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
Memory parameters have been adjusted to 16G, default was 28G, TiKV node system memory is 64G.
Observing the tikv-cdc cdc memory monitoring, it keeps increasing. How can we control it?
Reloading the ticdc cluster does not release the memory occupied by tikv cdc.
Why is the cdc memory usage so high? Doesn’t this affect the stability of the cluster with all the synchronization tasks?
The issue appeared after upgrading from version 5.2.1 to 6.5.0, suspecting it might be related to the version.

| username: yulei7633 | Original post link

Synchronizing data, CDC memory naturally keeps growing.

| username: TiDBer_ZsnVPQB4 | Original post link

Is this reasonable? Not recycling, not releasing, not controlling the size of the occupation? It occupies 7G at startup and keeps growing to 40G, causing tikv-server to OOM. This essentially means that the synchronization software is affecting the stability of the cluster.

| username: Jasper | Original post link

It may be caused by a large wide table in the synchronization task. You can limit the memory usage of a single table through per-table-memory-quota. For details, please refer to TiCDC 简介 | PingCAP 文档中心

| username: TiDBer_ZsnVPQB4 | Original post link

Setting it to 10MB and restarting CDC doesn’t work. The CDC cluster has 128GB of memory, and each node only uses around 3GB of memory. However, the CDC memory on the TiKV server nodes occupies a large amount of memory. According to the documentation, the main memory usage on the TiKV server is from storage.block-cache-size.capacity (28GB) and write-buffer-size (default 128MB). The TiKV server has a total of 64GB of memory and is not deployed in a mixed environment. The memory-usage-limit calculated by TiDB itself is 48GB. Even if I adjust storage.block-cache-size.capacity to 16GB, the total memory of the TiKV server will still rise to 48GB and eventually OOM. The memory usage of the CDC cluster is very low, mainly the TiKV server is high, and I have no idea why.

| username: TiDBer_ZsnVPQB4 | Original post link

The memory load on the CDC node is very low, with no pressure at all.



However, the memory on the TiKV node keeps increasing.

| username: Jasper | Original post link

If possible, could you please share the TiKV-Details monitoring and TiCDC monitoring? Is there any delay in CDC synchronization now?

| username: TiDBer_ZsnVPQB4 | Original post link

There is no delay in CDC. I checked the TiKV logs, and previously the tikv.log files were very small. Since the upgrade, the tikv.log files have become very large, containing some CDC-related errors.

The current tikv.log is still reporting this error.

Could the command used to create the TiCDC synchronization be related to the parameters?

tiup ctl:v6.5.0 cdc changefeed create --pd=http://10.30.30.4:2379 --sink-uri="kafka://node2.prod.com:9092/ticdc?kafka-version=2.13.3&partition-num=3&max-message-bytes=128108864&replication-factor=3&protocol=canal-json&compression.type=lz4" --changefeed-id="ticdc-prd"

After reloading TiKV, the CDC memory goes back to 6-7G but will keep increasing to 34G until it hits OOM (Out of Memory) for TiKV.

| username: TiDBer_ZsnVPQB4 | Original post link

The TiKV logs are still reporting errors.

| username: xfworld | Original post link

Please provide information about the cluster configuration and deployment, don’t make everyone guess :upside_down_face:

| username: TiDBer_ZsnVPQB4 | Original post link

The TiKV configuration is all default (many parameters have been dynamically adjusted, but it didn’t help). Only a few parameters were set for the TiDB node.


2 TiDB, 5 TiKV, 3 PD, 2 TiCDC cluster, extracting data to Kafka. Currently, the memory of the TiKV-server node keeps increasing, specifically the CDC memory. The TiCDC synchronization checkpoint is progressing normally without delay.

| username: liuis | Original post link

pprof analysis time

| username: Jasper | Original post link

Use https://metricstool.pingcap.net/ to export the monitoring data of tikv-details and cdc as a JSON file and take a look.

| username: yilong | Original post link

The memory here does not refer to the memory of CDC. You can check if the resolved-ts.enable parameter is set to true, and try setting the parameter to false to see if there is any improvement?

| username: Billmay表妹 | Original post link

Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

Let’s take a look at your configuration.

| username: TiDBer_ZsnVPQB4 | Original post link

Couldn’t find tivk. There is a pprof debug interface address.

| username: TiDBer_ZsnVPQB4 | Original post link

tidb-test-TiCDC_2023-05-16T08_10_09.304Z.rar (906.7 KB)
tidb-test-TiKV-Details_2023-05-16T08_00_47.362Z.rar (1.2 MB)

| username: TiDBer_ZsnVPQB4 | Original post link

Yes, it is the CDC memory. The CDC cluster itself uses very little memory. This is the TiKV server metric CDC memory indicator. This parameter cannot be dynamically set. Let’s wait for it to increase. I will try reloading the configuration file to see.

| username: TiDBer_ZsnVPQB4 | Original post link

The image is not visible. Please provide the text you need translated.

| username: xfworld | Original post link

Where is ticdc installed?