Setting up a TiDB standby database: PITR recovery first, then using CDC to synchronize incremental data results in an exception

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 搭建tidb备库,先pitr恢复,再用cdc同步增量数据异常

| username: porpoiselxj

[TiDB Usage Environment] Production Environment
[TiDB Version] v7.1.3
[Reproduction Path]
Preparing to set up a TiDB standby database, considering that in previous versions, creating CDC with high latency would likely cause CDC to fail to start, so this time we adopted the following strategy:

  1. Deploy BR log incremental backup on the primary database

  2. Perform a full database backup on the primary database

  3. Perform PITR full database + incremental recovery on the standby database, and subsequently performed multiple PITR incremental recoveries, all showing successful recovery without any anomalies

  4. Establish CDC on the primary database to synchronize incremental data to the standby database

  5. Various errors occurred after establishing CDC

  6. Continued PITR incremental recovery, which was still successful.

[Encountered Issues: Problem Phenomenon and Impact]

Especially the error mentioning primary key conflict in WIND_TB_OBJECT_6489_OPLOG. This table doesn’t even have a primary key, only a few regular indexes, and it’s a partitioned table.

| username: 春风十里 | Original post link

I see that the error is “duplicate entry,” which means there are duplicate values. Although it mentions “primary” later, it might just be a manifestation of the error. TiCDC best practices require tables to have effective indexes, which are essentially primary keys or unique indexes. Is it possible that your table has duplicate data because it lacks a primary key? You can group by all columns to check if there are any duplicate values.
TiCDC Overview | PingCAP Documentation Center

| username: TIDB-Learner | Original post link

TiDB has an implicit primary key, right? Is it rowid?

| username: dba远航 | Original post link

Is it true that CDC does not obtain the initial synchronization position of the primary database?

| username: porpoiselxj | Original post link

This table is a partitioned table, a log record table, without primary keys and unique indexes. CDC synchronization has enabled force-replicate to support the synchronization of tables without primary keys and unique indexes.

| username: porpoiselxj | Original post link

This feels possible.

| username: porpoiselxj | Original post link

This shouldn’t be the case. TSO is set based on the output of restore-to when the downstream recovery is successful.

| username: luancheng | Original post link

What is the tidb_rowid after the downstream cluster is restored?

You can check by executing the SQL: show table test.t next_row_id;

| username: Soysauce520 | Original post link

There will be issues without a primary key, _tidb_rowid might be duplicated. The data accuracy cannot be guaranteed either.

| username: porpoiselxj | Original post link

This return value is consistent between the primary and standby databases: 958148143. It seems a bit far from the primary key error mentioned above.

| username: luancheng | Original post link

Can you see the erroneous DML statement in the CDC log? Is it an insert or an update?

| username: porpoiselxj | Original post link

I didn’t see it in the logs, but for this table, we only have insert operations, no updates or deletions.

| username: luancheng | Original post link

Could you briefly describe the schema and the corresponding recovery steps? We will try to reproduce it.

  1. All component versions are 7.1.3 by default. :white_check_mark:
  2. Table structure (without primary key and unique index) and insert statements. :white_check_mark:
  3. Steps to set up upstream and downstream clusters?
  4. The article describes multiple recoveries, so what are the respective time points for selecting the log backups?
| username: kkpeter | Original post link

The error message says that Primary has nothing to do with rowid, right?

| username: 双开门变频冰箱 | Original post link

rowid?

| username: porpoiselxj | Original post link

  1. After testing, it has nothing to do with whether the table is partitioned or has a primary key, because after I blocked the table that reported the error above, other non-partitioned tables with primary keys also reported similar errors.
  2. There is nothing special about the table structure. Based on the test results, it feels like any table will report an error.
  3. The configurations of the upstream and downstream clusters are completely consistent, including hardware and system parameters. Previously, in version v6.1.1, the primary and standby environments ran for a period of time. Later, CDC reported errors and could not be maintained, so the standby database was stopped. This time, before recovery, the standby database was cleared first.
  4. After each PITR recovery is successful, there will be a restore-to record of the recovery end point TSO. Use this TSO as the start TSO for the next PITR.
| username: zhanggame1 | Original post link

Try tso+1

| username: porpoiselxj | Original post link

Still not working.

| username: asddongmen | Original post link

Hello, I would like to confirm again, when creating a changefeed, is the procedure to first stop PITR, and then use the restore-to time of PITR as the startTs for the changefeed?

| username: flow-PingCAP | Original post link

@porpoiselxj Hello! Could you help organize a complete minimal reproducible example? For instance, including table creation, data import, etc. If we can reproduce this issue internally, we might be able to identify the root cause more quickly.