BR Full Backup Error: ticdc Reports ErrSnapshotLostByGC

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: BR全量备份时,ticdc报错ErrSnapshotLostByGC

| username: wfxxh

【TiDB Usage Environment】Production Environment
【TiDB Version】v6.5.3
【Reproduction Path】Performing a full backup of the cluster using BR

image

image

My ticdc has been running for several days now, and the TSO 442730843081539585 was specified when I first started ticdc on July 9, 2023.

| username: tidb菜鸟一只 | Original post link

Please send the backup script.

| username: wfxxh | Original post link

| username: 裤衩儿飞上天 | Original post link

Isn’t setting the GC time to 1 hour too short?
By the time the target end restores the database and starts CDC, hasn’t it already exceeded the source database’s GC time of 1 hour?

| username: wfxxh | Original post link

TiCDC was started a few days ago.

| username: wfxxh | Original post link

Let me add a bit more. The TiCDC on the cluster where I am performing the backup has been running for almost two weeks. After running BR for a while today, TiCDC reported an error.

| username: tidb菜鸟一只 | Original post link

Did the BR backup succeed in the end? Check what the last physical backup time point was.

tiup br validate decode --field="end-version" \
--storage "s3://backup-101/snapshot-202209081330?access-key=${access-key}&secret-access-key=${secret-access-key}" | tail -n1
| username: wfxxh | Original post link

Backup successful, the value is

| username: tidb菜鸟一只 | Original post link

Then your ticdc error shouldn’t be related to this br backup…

| username: 裤衩儿飞上天 | Original post link

It’s possible that there was an error when you started CDC.
Check if any data has been synchronized after the time you started it.

| username: wfxxh | Original post link

The issue is that my ticdc has been running for almost two weeks without any errors.

| username: wfxxh | Original post link

Logs before ticdc crashed:


| username: 裤衩儿飞上天 | Original post link

Could you please provide a complete CDC log? The information in your screenshot doesn’t seem to be from the first error occurrence.

| username: wfxxh | Original post link

This is the time point of the first error. :sweat_smile:

| username: wfxxh | Original post link

cdc-20230720.log (1.3 MB)

| username: 裤衩儿飞上天 | Original post link

Check the status of changefeed bak-tidb
Manage Changefeed | PingCAP Documentation Center
See where the current checkpoint is, and then check the logs for the corresponding time period.

| username: wfxxh | Original post link

Brother, my first screenshot is the status of changefeed bak-tidb. I also took a screenshot of the checkpoint and uploaded the logs for the relevant time period.

| username: 裤衩儿飞上天 | Original post link

  1. The cause of this issue is that the error message is triggered by using data that has already been garbage collected.
  2. The troubleshooting approach is to find out how the earlier checkpoint came about.
    Start from the beginning, check when the task was created,
    and what operations were performed after the task was created (logs corresponding to the time points).
  3. Your screenshot information is not complete enough.

You can follow this approach to go through it. If everything is confirmed to be fine, then it might be a bug.

| username: wfxxh | Original post link

That checkpoint is the time when I first started the ticdc task.

| username: 裤衩儿飞上天 | Original post link

  1. Use this command to find the changefeed information:
    cdc cli changefeed query --server=http://10.0.10.25:8300 --changefeed-id=simple-replication-task

  2. Retrieve the logs corresponding to the checkpoint time, which are the logs from when the task was created;
    There might have been an issue when this changefeed was created.