Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: TICDC日志量统计
Version: v7.1.0
Question: When enabling CDC to synchronize to downstream MySQL, I want to measure the size of the change logs (KV change logs) generated during this process.
Sub-question 1: If CDC is started but no changefeed is created, where are the change logs (incremental synchronization data)? Are they in memory? I couldn’t find the corresponding persistent files in cdc/data.
Sub-question 2: CDC is started and a changefeed is created. The operations performed in TiDB can indeed be synchronized to downstream MySQL. At this point, checking cdc/data, I can find the corresponding data files. I found that there are 8 folders in the cdc/data/tmp/sorter directory: 0000~0007. What do these directories represent? Also, I noticed that only 0, 2, 4, and 6 are increasing in size, while the other 4 remain unchanged. Why is this? Why is there only a sorter directory and no puller, mounter, etc.?
Sub-question 3: If I want to measure the size of the change logs generated over a certain period, I found a metric on Prometheus: ticdc_sorter_db_write_bytes_sum. Can this value represent the amount of change logs currently generated?
Sub-question 1: If only CDC is started but no changefeed is created, the change logs (incremental synchronization data) are stored in memory. CDC will cache the change data in memory, waiting for the downstream synchronization task to pull it. If you want to persistently store the change data, you need to create a changefeed and synchronize the data to the downstream.
Sub-question 2: After starting CDC and creating a changefeed, CDC will write the change data into temporary files under the sorter directory. The 0000~0007 folders under the sorter directory represent different sorter workers, and each worker will write the change data into the corresponding folder. The number of folders under the sorter directory depends on the number of sorter workers, which can be controlled by the sorter.workers parameter in the configuration file. The files in the folders under the sorter directory are temporary files used to store change data, and they will be deleted after the data is written. The number and size of files in the folders under the sorter directory will increase with the increase of change data. When a certain threshold is reached, CDC will merge these files into a larger file and send it to the downstream synchronization task. Only sorter-related folders exist under the sorter directory because other components of CDC (such as puller and mounter) do not need to store data locally.
Sub-question 3: The metric ticdc_sorter_db_write_bytes_sum represents the total size of data written by the sorter, including data already written to temporary files and data being written. If you want to calculate the size of change logs generated over a certain period, you can use this metric in combination with Prometheus query language. For example, if you want to query the size of data written in the past hour, you can use the following query:
sum(rate(ticdc_sorter_db_write_bytes_sum[1h]))
This query will return the total size of data written in the past hour.
Thank you very much, the answer is excellent and has resolved many of my confusions. However, I still have some minor doubts. The metric ticdc_sorter_db_write_bytes_sum
calculates the change logs from TiKV to TiCDC, but it does not represent the data ultimately sent by CDC to downstream systems like MySQL. If I want to count the logs sent by CDC to downstream MySQL, how should I do it?
(P.S.: It might be related to the output data protocol. I deployed it using tiup playground --ticdc 1
, and the changefeed was created using `cdc cli changefeed create --server=http://ip:8300 --sink-uri=“mysql://root:xxxxxx@127.0.0.1:3306” --changefeed-id=“simple-replication-task”. Without specifying the protocol, which output protocol is used by default?)
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.