Questions About the TiCDC Configuration File

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于ticdc配置文件的疑问

| username: 胡杨树旁

Some questions about TiCDC:

  1. TiCDC is an incremental data synchronization tool that captures changes from TiKV logs. When deploying, there is a data_dir in the configuration file. After starting TiCDC, I observed that no files are generated in the data directory. I would like to ask what this data directory is used for?
  2. TiCDC is used in conjunction with changefeed, which can specify downstream synchronization targets such as MySQL-compatible databases, Kafka, or some storage like S3/NFS. I would like to ask, when the downstream is a database or Kafka, where does the source data for synchronization come from? Is it directly obtained from the TiDB database or read from local files? If it is read from local files, where is this data stored?
| username: 这里介绍不了我 | Original post link

Check this out, it should be helpful to you: [Resource Compilation] The Most Comprehensive Resources for TiDB-TiCDC Source Code Interpretation Series!!! - Billmay’s Column - 专栏 - 【资源汇总】TiDB-TiCDC 源码解读系列最全资源!!! | TiDB 社区

| username: 胡杨树旁 | Original post link

Okay, thank you.

| username: Daniel-W | Original post link

  1. It is used for sorting, and this directory will only have data after the changefeed is running and data synchronization has occurred.
  2. The source of the synchronized data is the TiDB database, specifically the change log in TiKV.
| username: 胡杨树旁 | Original post link

Understood, thank you. I have one more question. Regarding the first question, do the files in the data directory need to be manually cleaned? Also, is the data parsed by changefeed also from this directory?

| username: Daniel-W | Original post link

There is no need for manual cleanup; it will clean itself after synchronizing to the downstream.

| username: Daniel-W | Original post link

TiCDC Server Configuration | PingCAP Documentation Center

| username: TiDBer_q2eTrp5h | Original post link

I’ll also come to learn.

| username: TiDBer_q2eTrp5h | Original post link

TiDB 之 TiCDC6.0 初体验 - 知乎 (zhihu.com)

| username: 胡杨树旁 | Original post link

To ensure the high availability of CDC, it is generally recommended to deploy more than one CDC instance. Multiple captures will elect one owner. I would like to ask how this owner is elected, as there is no relevant introduction in the official documentation.

| username: Jasper | Original post link

This article explains the specific principles of the election process.

| username: 小龙虾爱大龙虾 | Original post link

It is used for temporary sorting and also acts as a buffer when the downstream sink is not fast enough.

Ticdc is a TiDB data synchronization tool, and its upstream is always TiDB. It captures changes from tikv to obtain incremental data.

| username: 胡杨树旁 | Original post link

Thank you very much, I learned a lot.

| username: 胡杨树旁 | Original post link

Okay, regarding the first question, if the downstream consumption is fast enough, there will be no files generated in the cdc data directory. It can be understood that the produced data is consumed immediately, and only when there is a delay downstream, some data will be cached in the data directory.