Frequent message loss from TiCDC to Kafka, is there any way to record data from the TiCDC source to determine if there are issues during incremental data transmission?

translator_bot · June 22, 2024, 7:18am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc到kafka那边经常丢消息，有没有什么办法，从ticdc源头侧做一个数据记录，判断增量数据的传输过程中是否出现问题。

| username: vcdog

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.0
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Phenomenon and Impact] Messages frequently get lost from ticdc to Kafka. Is there any way to record data from the ticdc source to determine if there are issues during the transmission of incremental data?

Recently encountered a rather awkward issue:

The GC time set for the TiDB main cluster is 72 hours.
The message retention time set for the Kafka cluster is 24 hours.
Since the last full + incremental operation, after about a week, the R&D personnel reported data loss issues.
At this point, trying to specify the TSO from the time they lost data and redo the incremental has already exceeded the GC time.

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 22, 2024, 7:18am

| username: tidb狂热爱好者 | Original post link

Try CloudCanal.

translator_bot · June 22, 2024, 7:18am

| username: redgame | Original post link

Please provide the relevant logs.

translator_bot · June 22, 2024, 7:18am

| username: ljluestc | Original post link

【TiDB Usage Environment】Production Environment
【TiDB Version】v6.5.0
【Reproduction Path】What operations were performed that caused the issue
【Encountered Issue: Problem Phenomenon and Impact】Messages are frequently lost from ticdc to Kafka. Is there any way to record data from the ticdc source side to determine if there are issues during the transmission of incremental data?

Recently encountered a rather awkward issue:

The GC time set for the TiDB main cluster is 72 hours.

The message retention time set for the Kafka cluster is 24 hours.

Since the last full + incremental operation, after about a week, the development team reported data loss issues.

At this point, trying to specify tso from the time they lost data and redo the increment has already exceeded the GC time.

【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

Verify CDC Configuration: Carefully check the ticdc configuration to ensure it is set up correctly. Ensure that the ticdc instance is running and properly connected to the TiDB cluster. Confirm that the relevant tables and databases are correctly configured for CDC.

Monitor ticdc Metrics: Monitor ticdc metrics to gain insights into its performance and any potential issues. Pay attention to metrics related to CDC errors, replication lag, or any abnormal behavior. This can help you identify potential bottlenecks or errors that might be causing data loss. You can use tools like Prometheus and Grafana to monitor ticdc metrics.

Enable Ticdc Debug Mode: ticdc provides a debug mode that allows you to output data records from the source side. By enabling debug mode, you can capture the data records that ticdc is generating and transmitting. This can help you determine if there are any issues or data loss during transmission. You can enable debug mode by setting the ticdc.enable-debug-mode configuration option to true in the ticdc configuration file.

Verify Kafka Configuration: Carefully check your Kafka configuration to ensure it is properly set up to handle the expected load and message retention time. Ensure that the Kafka cluster has sufficient resources (storage, memory, etc.) and that the retention time is set appropriately to avoid premature deletion of messages.

Check Network Connectivity: Verify the network connectivity between the ticdc instance and the Kafka cluster. Ensure there are no network issues or firewall rules blocking communication between the two.

Consider Data Replication Technologies: Depending on the specific requirements and constraints of your application, you might consider using other data replication technologies, such as TiDB Binlog or TiDB Data Migration (DM) tools. These tools provide additional control and monitoring capabilities for data replication, which can help reduce the risk of data loss.

translator_bot · June 22, 2024, 7:18am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.