TiCDC CDC:ErrSnapshotLostByGC Error, gc-ttl Configuration 172800, gc_safe_point Keeps Advancing

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC CDC:ErrSnapshotLostByGC报错,gc-ttl配置 172800 ,gc_safe_point 一直再推进

| username: Jeff-Ye

【TiDB Usage Environment】Production
【TiDB Version】v5.4.0
【Encountered Problem】
An exception of oversized messages occurred at 3 AM on the 18th and was not handled in time.
After adjusting parameters around 2 PM, an error ErrSnapshotLostByGC was reported. The task could not continue.
[CDC:ErrSnapshotLostByGC] fail to create or maintain changefeed due to snapshot loss caused by GC. checkpoint-ts 434656888387272706 is earlier than or equal to GC safepoint at 434667331771695104"

【Documentation Description】
The downstream continued to be abnormal, and TiCDC failed multiple retries.
In this scenario, TiCDC will save task information. Since TiCDC has already set the service GC safepoint in PD, data after the synchronization task checkpoint will not be cleaned by TiKV GC within the effective period of gc-ttl.

gc-ttl: 172800

Why was the data GC’d so quickly, and are there any other parameters to control this?

【Reproduction Path】What operations were performed to cause the problem
【Problem Phenomenon and Impact】
【Attachments】


If the question is about performance optimization or fault troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results for upload.

| username: 箱子NvN | Original post link

The tidb_gc_life_time was introduced starting from version v5.0.

  • Scope: GLOBAL
  • Persisted to the cluster: Yes
  • Default value: 10m0s
  • Range: [10m0s, 8760h0m0s]
  • This variable is used to specify the retention period for data during each garbage collection (GC). The variable value is in Go’s Duration string format. During each GC, the safe point is determined by subtracting the value of this variable from the current time.
| username: 箱子NvN | Original post link

There are also some other parameters related to GC that you can check in the official documentation. The link is as follows:

After opening it, just search for GC.

| username: songxuecheng | Original post link

During the interruption, the data was already cleaned up due to the too short tidb_gc_life_time, which caused this issue.

| username: neilshen | Original post link

Please provide the PD leader logs to investigate the issue.

| username: Jeff-Ye | Original post link

pd.log.tar.gz (7.4 MB)

| username: Jeff-Ye | Original post link

After adjusting the tidb_gc_life_time parameter, it seems to be normal.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.