No way to fully synchronize data to Kafka

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 没有可以全量同步数据到kafka

| username: wluckdog

[TiDB Usage Environment] Is there a tool that can fully synchronize data to Kafka?
[TiDB Version]
[Reproduction Path] What operations were performed that caused the issue
[Encountered Issues: Issue Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: ffeenn | Original post link

cdc Refer to Column - Best Practices for Synchronizing Data from TiDB to Kafka | TiDB Community

| username: tidb菜鸟一只 | Original post link

No, what does your application do, and why do you need to synchronize the full data to Kafka? How many messages will the downstream have to consume…

| username: ealam_小羽 | Original post link

You can refer to the previous reply and use TiCDC. I’m not sure about the specific scenario, but if it’s for business program initialization, it’s best for the program to actively pull data from TiDB for processing. This way, if there are any issues later on, such as needing to fix data or perform other operations, you can handle it directly through the program. On the other hand, the Kafka data channel of TiCDC might be reused, and if you need to re-push data, it could affect other business programs that are also using it.

| username: wluckdog | Original post link

CDC can only perform incremental synchronization to Kafka, which is not friendly for synchronizing tables with 1TB of data to other systems. How can data consistency be achieved during the transition between full and incremental synchronization?

| username: dba-kit | Original post link

You can use TSO to ensure the consistency of full and incremental data, and TiCDC supports specifying the start-ts parameter. However, there are very few components that directly support TiDB TSO for full data writing to Kafka, so you will need to encapsulate it yourself.

| username: dba-kit | Original post link

I have a similar scenario here, where I use the dumpling tool to import TiDB data into MySQL. Then, when creating the changefeed, I use the tso from the metadata file in dumpling as the start-ts to start exporting the incremental data.

| username: wluckdog | Original post link

The target end needs to synchronize to the big data side for operations such as reporting, for example, StarRock. The reason for wanting to synchronize with Kafka is because data assembly is needed, and batch import into StarRock. If using dump import, data assembly cannot be achieved.

| username: dba-kit | Original post link

In this scenario, TiCDC alone cannot solve the problem. Here is a method I can think of:

  1. Increase tidb_gc_life_time.
  2. Create a new table t1_new: create table t1_new like t1;.
  3. Use TiCDC to create a changefeed, so that changes in the t1_new table are transmitted to topic1.
  4. Select a time (e.g., tso1) and use set tidb_snapshot=<tso1> to insert a snapshot of the t1 table at a specific time into t1_new. This way, the full historical data will be transmitted to topic1.
  5. Create another changefeed, specifying --start-ts=<tso1>, to transmit incremental data changes from the t1 table to topic1.

Although this method is somewhat convoluted, it can still achieve the desired effect.