Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 求助:V7.1.1 新增同步任务到下游 TiDB 数据库导致TICDC不断重启
[TiDB Usage Environment] Production Environment
[TiDB Version] v7.1.1
[Reproduction Path]
- Use br backup db to back up a specified database to storage, then restore it to the downstream TiDB database (same version).
- Add changefeed, specify start-ts, and synchronize incremental data to the downstream.
[Encountered Problem: Phenomenon and Impact]
TICDC keeps restarting, occasionally showing errors like “resolved ts should not be less than checkpoint ts,” and then it automatically recovers after a while.
PS: There was an existing changefeed pushing to Kafka. Adding a new changefeed causes the original changefeed to get stuck and leads to continuous CDC restarts. There is still a batch of tables that haven’t been added, and we don’t dare to add them for fear of causing the original changefeed to get stuck and become unrecoverable.
Initially, we thought it was due to loading too many tables at once (around 700 tables). Later, we divided these 700 tables into batches, but the above problem still occurred.
cdc.log.rar (1.6 MB)
cdc changefeed query -s Check if there are any errors in the task.
The only error during this period is: resolved ts should not be less than checkpoint ts. The status is error, but it recovers after a while, and then CDC keeps restarting.
I think I’ve encountered this before, your TSO is too early.
The TSO is strictly set according to the backup TS generated by BR backup, so there is definitely no mistake. After creating a new changefeed, the TSO does not advance, and occasionally the above error occurs, followed by continuous restarts. After restarting several times, it suddenly works. If it were a TSO setting issue, it shouldn’t have worked. Currently, it seems to depend on luck whether it works properly. I had to keep splitting the table in the middle to barely get it running.
You can use tiup ctl:v4.0.13 pd -i -u http://10.xx:2379
to log in and check the corresponding TSO time, then see if the GC has already cleaned it up.
Before the operation, GC was turned off, at least 3 hours ahead of the TSO I backed up.
However, if your recovery takes too long, it will be cleaned up. I previously recovered for more than 24 hours, and then creating a task resulted in this error because the GC had already cleaned it up.
Oh, okay, then let’s look at other directions.
Could you recreate this task using the current default time to see if it works properly?
The default time is normal, but I can’t start from the current time.
Can someone help check the logs? Don’t let this sink!
The root cause might be that the parsed ts and checkpoint ts are calculated in different threads, which can lead to the checkpoint ts being greater than the parsed ts.
It is indeed possible that the issue is due to a probabilistic phenomenon, as it can automatically recover from the back. But what is the reason for the continuous restart of CDC?
TiCDC logs:
PD logs:
It seems that the issue is due to TS time desynchronization, causing TiCDC to be unable to find the metadata in etcd when updating data. This is my personal understanding!
TiCDC uses the etcd within PD to store metadata and updates it periodically. Due to etcd’s Multi-Version Concurrency Control (MVCC) and PD’s default compaction interval of 1 hour, the storage space occupied by TiCDC in PD is proportional to the number of metadata versions within that 1 hour.
Or TiCDC has fully occupied PD’s storage space etcd.
Are there any parameters that can avoid this problem as much as possible?
–start-ts is specified, and for small batch table backup and restore, the interval time is very short, not exceeding 1 hour.