Some questions about TiCDC:

  1. TiCDC is an incremental data synchronization tool that captures changes from TiKV logs. When deploying, there is a data_dir in the configuration file. After starting TiCDC, I observed that no files are generated in the data directory. I would like to ask what this data directory is used for?
  2. TiCDC is used in conjunction with changefeed, which can specify downstream synchronization targets such as MySQL-compatible databases, Kafka, or some storage like S3/NFS. I would like to ask, when the downstream is a database or Kafka, where does the source data for synchronization come from? Is it directly obtained from the TiDB database or read from local files? If it is read from local files, where is this data stored?
Okay, thank you.

  1. It is used for sorting, and this directory will only have data after the changefeed is running and data synchronization has occurred.
  2. The source of the synchronized data is the TiDB database, specifically the change log in TiKV.
Understood, thank you. I have one more question. Regarding the first question, do the files in the data directory need to be manually cleaned? Also, is the data parsed by changefeed also from this directory?

There is no need for manual cleanup; it will clean itself after synchronizing to the downstream.

I’ll also come to learn.

To ensure the high availability of CDC, it is generally recommended to deploy more than one CDC instance. Multiple captures will elect one owner. I would like to ask how this owner is elected, as there is no relevant introduction in the official documentation.

It is used for temporary sorting and also acts as a buffer when the downstream sink is not fast enough.

Ticdc is a TiDB data synchronization tool, and its upstream is always TiDB. It captures changes from tikv to obtain incremental data.

Thank you very much, I learned a lot.

Okay, regarding the first question, if the downstream consumption is fast enough, there will be no files generated in the cdc data directory. It can be understood that the produced data is consumed immediately, and only when there is a delay downstream, some data will be cached in the data directory.