How to perform full + incremental synchronization using TiCDC?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 使用ticdc如何进行全量+增量同步?

| username: TiDBer_5VobY5Th

What method is used for full synchronization? After completing the full synchronization, how do you determine the start-ts value when creating incremental synchronization? Experts, please advise! Thanks a lot! If the start-ts value is too small, there will be duplicate records; if it is too large, data will be missed.

| username: 啦啦啦啦啦 | Original post link

Whether using Dumpling or BR to export the full data, a timestamp (ts) will be recorded. You can use this to capture the incremental data.

| username: TiDBer_5VobY5Th | Original post link

The main goal is to implement full and incremental backups using code. If the data volume is large, dumpling might not be convenient, and it also involves creating the table structure in the target database.

| username: zhaokede | Original post link

Use other methods for the full amount, this is mainly for incremental.

| username: 像风一样的男子 | Original post link

For large amounts of data, BR is generally used for full backups. The BackupTS at the end of the backup log can be used as the start-ts for your CDC incremental backup.

| username: TiDBer_QYr0vohO | Original post link

First, perform a full backup with BR, then use CDC for incremental backup.

| username: DBAER | Original post link

This is the script, turning manual processes into automation. The specific steps are the same, just adding various status checks.

| username: porpoiselxj | Original post link

Correct, the BackupTs of the full backup is used as the start-ts for CDC.

| username: Jellybean | Original post link

  1. Question: What method should be used for full synchronization?
    Answer:
  • If the base data is not large, for example, at the GB level, you can use the dumpling concurrent export tool to export and use the lightning physical or logical import method to write into the downstream cluster.
  • If the base data is at the TB level, it is recommended to use the BR tool for physical backup and use BR for physical recovery downstream. Here, S3 is needed as external storage, which needs to be applied for and prepared in advance.
  1. Question: How to determine the start-ts value for incremental synchronization?
    Answer:
  • If you are using the BR tool for backup, there will be a BackupTS in its backup metadata or output log, which represents the last moment of the physical backup. When setting up the synchronization task, you can use this timestamp plus 1 as the starting point to configure the synchronization task.
  • If you are using the dumpling tool to export data, there will be a metadata file in the exported data directory that records the TSO of the exported data. You can use it plus 1 as the start-ts value.
  1. Question: If the start-ts value is too small, there will be duplicate records, and if it is too large, data will be missed.
    Answer: Actually, TiCDC data synchronization is reentrant. As long as your table has a unique key (primary key or ordinary unique key), it will use the latest data to overwrite it when encountering the same data. TiCDC controls this behavior through safe_mode, for example, when encountering the same data, it will convert insert or update to replace statements to achieve reentrant operations.
    Therefore, if your start-ts value is too small, it will not cause duplicate records (provided the table has a unique key); if the start-ts value is too large, data will indeed be lost.

So, as long as the start-ts value is less than or equal to the snapshot timestamp of your exported data, you can normally set up the synchronization flow.

| username: 小龙虾爱大龙虾 | Original post link

Just take a smaller start_ts, TiCDC requires a primary key, and its operations are idempotent.

| username: Jack-li | Original post link

Your answer is very detailed, expert.

| username: yytest | Original post link

Full Data Synchronization

  1. Using DM for Full Data Synchronization:
    • DM supports full data synchronization from databases like MySQL, MariaDB, Amazon RDS, Aurora, etc., to TiDB.
    • By using DM’s full-mode, you can complete the full data migration.
    • After the full synchronization is completed, DM will automatically switch to incremental synchronization mode.
  2. Using TiCDC for Full Data Synchronization:
    • TiCDC supports synchronizing data from one TiDB cluster to another TiDB cluster or compatible downstream systems.
    • TiCDC does not directly support full synchronization, but you can pause TiDB’s write operations, use TiDB’s BR (Backup & Restore) tool for a full backup, restore the backup data to the target TiDB cluster, and then start TiCDC to synchronize incremental data.

Determining the Start-ts for Incremental Synchronization

The start-ts (start timestamp) for incremental synchronization is a crucial parameter that determines the starting point of incremental synchronization. Choosing the correct start-ts can avoid data duplication or omission.

  1. Using DM:
    • After completing the full synchronization, the DM tool will automatically record a checkpoint, and the timestamp of this checkpoint is the start-ts for incremental synchronization.
    • If you need to manually set the start-ts, you can specify it in the DM task configuration file.
  2. Using TiCDC:
    • When using TiCDC for synchronization, the start-ts can be obtained by querying TiDB’s cdc cli tool.
    • The changefeed query command of cdc cli can be used to view existing synchronization task information, including start-ts.
    • If you need to manually set the start-ts, you can specify it using the --start-ts parameter when creating a Changefeed.

Avoiding Improper Start-ts Settings

  • Avoid setting start-ts too early: This may lead to duplicate data synchronization. Ensure that the start-ts is the timestamp after the full synchronization is completed.
  • Avoid setting start-ts too late: This may lead to data omission. Ensure that the start-ts is not later than the timestamp of the first incremental data after the full synchronization is completed.

Safety Measures

  • Before setting the start-ts, ensure that the full synchronization has been completed and that there are no new write operations.
  • Before starting incremental synchronization, you can back up the target TiDB cluster as a precaution in case a rollback is needed.
| username: xfworld | Original post link

After version 8.1, Debezium is supported, so your idea can be implemented…

However, the downstream system or service will need to integrate with Debezium’s event handling.

| username: TiDBer_5VobY5Th | Original post link

I did not see any information related to safe_mode in the ticdc API.

| username: TiDBer_5VobY5Th | Original post link

TICDC is not recommended for use

Why does TiCDC synchronization experience lag or even get stuck after using TiDB Lightning Physical Import Mode and BR to restore data upstream?

Currently, TiCDC has not fully adapted to TiDB Lightning Physical Import Mode and BR. Please avoid using TiDB Lightning Physical Import Mode and BR on tables that are being synchronized by TiCDC. Otherwise, unknown errors may occur, such as TiCDC synchronization getting stuck, significant increase in synchronization delay, or data loss during synchronization.

| username: TiDBer_5VobY5Th | Original post link

Flink-CDC claims to support full replication + incremental replication for TiDB. How does its full + incremental replication work? It seems to use TiDB’s snapshot feature. Does TiDB support snapshots?

| username: xfworld | Original post link

Unreliable…

If you need both full and incremental data, you can try TiDB version 8.1.0 and see how to enable Debezium support in TICDC.

Debezium supports both full and incremental data by default.