Help Needed: Adding New Sync Task to Downstream TiDB Database in V7.1.1 Causes TICDC to Restart Continuously

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 求助:V7.1.1 新增同步任务到下游 TiDB 数据库导致TICDC不断重启

| username: porpoiselxj

[TiDB Usage Environment] Production Environment
[TiDB Version] v7.1.1
[Reproduction Path]

  1. Use br backup db to back up a specified database to storage, then restore it to the downstream TiDB database (same version).
  2. Add changefeed, specify start-ts, and synchronize incremental data to the downstream.

[Encountered Problem: Phenomenon and Impact]
TICDC keeps restarting, occasionally showing errors like “resolved ts should not be less than checkpoint ts,” and then it automatically recovers after a while.

PS: There was an existing changefeed pushing to Kafka. Adding a new changefeed causes the original changefeed to get stuck and leads to continuous CDC restarts. There is still a batch of tables that haven’t been added, and we don’t dare to add them for fear of causing the original changefeed to get stuck and become unrecoverable.

Initially, we thought it was due to loading too many tables at once (around 700 tables). Later, we divided these 700 tables into batches, but the above problem still occurred.

cdc.log.rar (1.6 MB)

| username: 路在何chu | Original post link

cdc changefeed query -s Check if there are any errors in the task.

| username: porpoiselxj | Original post link

The only error during this period is: resolved ts should not be less than checkpoint ts. The status is error, but it recovers after a while, and then CDC keeps restarting.

| username: 路在何chu | Original post link

I think I’ve encountered this before, your TSO is too early.

| username: porpoiselxj | Original post link

The TSO is strictly set according to the backup TS generated by BR backup, so there is definitely no mistake. After creating a new changefeed, the TSO does not advance, and occasionally the above error occurs, followed by continuous restarts. After restarting several times, it suddenly works. If it were a TSO setting issue, it shouldn’t have worked. Currently, it seems to depend on luck whether it works properly. I had to keep splitting the table in the middle to barely get it running.

| username: 路在何chu | Original post link

You can use tiup ctl:v4.0.13 pd -i -u http://10.xx:2379 to log in and check the corresponding TSO time, then see if the GC has already cleaned it up.

| username: porpoiselxj | Original post link

Before the operation, GC was turned off, at least 3 hours ahead of the TSO I backed up.

| username: 路在何chu | Original post link

However, if your recovery takes too long, it will be cleaned up. I previously recovered for more than 24 hours, and then creating a task resulted in this error because the GC had already cleaned it up.

| username: 路在何chu | Original post link

Oh, okay, then let’s look at other directions.

| username: Fly-bird | Original post link

Could you recreate this task using the current default time to see if it works properly?

| username: porpoiselxj | Original post link

The default time is normal, but I can’t start from the current time.

| username: porpoiselxj | Original post link

Can someone help check the logs? Don’t let this sink!

| username: TiDBer_小阿飞 | Original post link

The root cause might be that the parsed ts and checkpoint ts are calculated in different threads, which can lead to the checkpoint ts being greater than the parsed ts.

| username: TiDBer_小阿飞 | Original post link

https://github.com/pingcap/tiflow/pull/9434/files

| username: porpoiselxj | Original post link

It is indeed possible that the issue is due to a probabilistic phenomenon, as it can automatically recover from the back. But what is the reason for the continuous restart of CDC?

TiCDC logs:

PD logs:

| username: TiDBer_小阿飞 | Original post link

It seems that the issue is due to TS time desynchronization, causing TiCDC to be unable to find the metadata in etcd when updating data. This is my personal understanding!

TiCDC uses the etcd within PD to store metadata and updates it periodically. Due to etcd’s Multi-Version Concurrency Control (MVCC) and PD’s default compaction interval of 1 hour, the storage space occupied by TiCDC in PD is proportional to the number of metadata versions within that 1 hour.

| username: TiDBer_小阿飞 | Original post link

Or TiCDC has fully occupied PD’s storage space etcd.

| username: porpoiselxj | Original post link

Are there any parameters that can avoid this problem as much as possible?

| username: TiDBer_小阿飞 | Original post link

  • --start-ts: Specifies the start TSO of the changefeed. The TiCDC cluster will start pulling data from this TSO. The default is the current time.
  • --target-ts: Specifies the target TSO of the changefeed. The TiCDC cluster will pull data until this TSO and then stop. The default is empty, meaning TiCDC will not stop automatically.
| username: porpoiselxj | Original post link

–start-ts is specified, and for small batch table backup and restore, the interval time is very short, not exceeding 1 hour.