TiCDC CDC:ErrSnapshotLostByGC Error, gc-ttl Configuration 172800, gc_safe_point Keeps Advancing

translator_bot · June 23, 2024, 9:35am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC CDC:ErrSnapshotLostByGC报错，gc-ttl配置 172800 ,gc_safe_point 一直再推进

| username: Jeff-Ye

【TiDB Usage Environment】Production
【TiDB Version】v5.4.0
【Encountered Problem】
An exception of oversized messages occurred at 3 AM on the 18th and was not handled in time.
After adjusting parameters around 2 PM, an error ErrSnapshotLostByGC was reported. The task could not continue.
[CDC:ErrSnapshotLostByGC] fail to create or maintain changefeed due to snapshot loss caused by GC. checkpoint-ts 434656888387272706 is earlier than or equal to GC safepoint at 434667331771695104"

【Documentation Description】
The downstream continued to be abnormal, and TiCDC failed multiple retries.
In this scenario, TiCDC will save task information. Since TiCDC has already set the service GC safepoint in PD, data after the synchronization task checkpoint will not be cleaned by TiKV GC within the effective period of gc-ttl.

gc-ttl: 172800

Why was the data GC’d so quickly, and are there any other parameters to control this?

【Reproduction Path】What operations were performed to cause the problem
【Problem Phenomenon and Impact】
【Attachments】

Relevant logs, configuration files, Grafana monitoring (https://metricstool.pingcap.com/)
TiUP Cluster information
TiUP Cluster Edit config information
TiDB-Overview monitoring
Corresponding module Grafana monitoring (if any, such as BR, TiDB-binlog, TiCDC, etc.)
Corresponding module logs (including logs one hour before and after the problem)
20220718TiCDC问题排查确认.txt (14.9 KB)
cdc0718.log.tar.gz (8.4 MB)

If the question is about performance optimization or fault troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results for upload.

translator_bot · June 23, 2024, 9:35am

| username: 箱子NvN | Original post link

The tidb_gc_life_time was introduced starting from version v5.0.

Scope: GLOBAL
Persisted to the cluster: Yes
Default value: 10m0s
Range: [10m0s, 8760h0m0s]
This variable is used to specify the retention period for data during each garbage collection (GC). The variable value is in Go’s Duration string format. During each GC, the safe point is determined by subtracting the value of this variable from the current time.