TICDC Single Table Synchronization Exception, Help Needed!

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TICDC单个表同步异常,求救!!!

| username: jaybing926

【TiDB Usage Environment】Production environment
【TiDB Version】v4.0.9
【Reproduction Path】Operations performed that led to the issue
br performs a full backup of data on the 1st of every month and incremental backups daily.
br first restores the full data from the 1st to the new TiDB cluster, then restores the incremental data for the day.
cdc configuration is based on the TSO of the day’s incremental data for data synchronization.

【Encountered Issue: Symptoms and Impact】
One table (this table is particularly large, with approximately 6 billion rows) has an abnormal synchronization status, while other tables seem to be normal.

The cdc log keeps reporting this kind of information:

I want to ask what is going on here? The incremental time is from midnight today, and it’s only been 20 hours, how can it be 53.2 years?

| username: tidb狂热爱好者 | Original post link

The system time is expected to be synchronized for 52 years?

| username: jaybing926 | Original post link

It seems like that’s the case. I guess the table might be too large, and the old cluster’s performance isn’t very good. But it shouldn’t take 53 years. I can’t figure it out and don’t know how to handle it now. Even with weak performance, it’s a cluster that’s currently in use online, with data that’s less than a day behind. Where did the 53 years come from? If there’s an issue, it should directly report an error and exit. The task status has always been normal.

| username: sdojjy | Original post link

It should be that the metrics are abnormal. It took a time of 0, guessing it is 1970 + 53 = 2023.

| username: jaybing926 | Original post link

How should this be confirmed?

| username: sdojjy | Original post link

The time point to which it has been synchronized is shown in this image.
However, your CDC version is quite old, so there may be other unknown bugs.

| username: jaybing926 | Original post link

This time is the initial time, which is the time I started the incremental recovery. It has always been like this from the beginning and hasn’t changed.

| username: CuteRay | Original post link

Check the time on these machines. Are these machines using the same NTP server?

| username: jaybing926 | Original post link

Yes, both the new and old clusters use chronyd to synchronize with the local NTP and Alibaba Cloud NTP.

| username: liuis | Original post link

It should be a metric anomaly.

| username: jaybing926 | Original post link

Are you saying that the Prometheus metric monitoring values are abnormal? How can we check if the actual status is normal?

| username: jaybing926 | Original post link

The server is a new machine with a configuration of 48 cores and 256GB of memory. There is still plenty of memory left, and the CDC process is only using 100GB of memory.

Here are some screenshots from Grafana:





| username: jaybing926 | Original post link

Below are my CDC synchronization configuration file and commands:

image

| username: CuteRay | Original post link

How is the task status queried by cdcctl?

| username: sdojjy | Original post link

Are there still a lot of warnings in the CDC log now?
You can use the method mentioned here [FAQ] Grafana Metrics 页面的导出和导入 - TiDB 的问答社区 to export from Grafana.

| username: jaybing926 | Original post link

It’s still the same as the screenshot above, no change, the status is also normal.

| username: asddongmen | Original post link

  1. Hello, I suggest using a filter rule to synchronize this table with a separate changefeed, while using another changefeed for the remaining tables. This approach can help advance the synchronization progress of other tables and reduce resource competition among multiple tables within the same changefeed.
  2. Please provide the number of regions in the upstream TiKV cluster.
| username: jaybing926 | Original post link

Yes, there are also a large number of warn logs.

Below is the Grafana export:
test-cluster-CDC_2023-03-15T03_19_02.996Z.json (7.0 MB)

| username: jaybing926 | Original post link

Does 1 mean to stop the current task and redo it?

| username: asddongmen | Original post link

Yes, correct. Record the CheckpointTs and create two new tasks.