How to Enable This New Feature: TiCDC Synchronizes Data to Kafka, Increasing Throughput from 4000 Rows per Second to 35000 Rows per Second, and Reducing Replication Latency to 2 Seconds

translator_bot · June 22, 2024, 11:36am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 如何打开这个新特性：TiCDC 同步数据到 Kafka，吞吐从 4000 行每秒提升到 35000 行每秒，复制延迟降低到 2 秒。

| username: vcdog

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.0
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Phenomenon and Impact] After upgrading the cluster from v5.4.0 to v6.5.0, it was found that the performance of ticdc synchronizing tasks to downstream Kafka did not improve. When there are batch transactions involving large tables, it gets stuck, and synchronization to downstream Kafka becomes very slow.

I would like to ask, to improve this throughput, which parameters need to be enabled or configured? Thank you!
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 22, 2024, 11:36am

| username: vcdog | Original post link

github.com

pingcap/docs/blob/master/ticdc/ticdc-sink-to-kafka.md

---
title: Replicate Data to Kafka
summary: Learn how to replicate data to Apache Kafka using TiCDC.
---

# Replicate Data to Kafka

This document describes how to create a changefeed that replicates incremental data to Apache Kafka using TiCDC.

## Create a replication task

Create a replication task by running the following command:

```shell
cdc cli changefeed create \
    --server=http://10.0.10.25:8300 \
    --sink-uri="kafka://127.0.0.1:9092/topic-name?protocol=canal-json&kafka-version=2.4.0&partition-num=6&max-message-bytes=67108864&replication-factor=1" \
    --changefeed-id="simple-replication-task"
```

This file has been truncated. show original

translator_bot · June 22, 2024, 11:36am

| username: vcdog | Original post link

I referred to the official configuration but couldn’t find the relevant configuration parameter.

translator_bot · June 22, 2024, 11:36am

| username: Billmay表妹 | Original post link

To enable the new feature of TiCDC synchronizing data to Kafka, you need to follow these steps:

Ensure your TiCDC version is v4.0.9 or higher, as this new feature was introduced in this version.
Follow the instructions in the official TiCDC documentation to create a changefeed that synchronizes data to Kafka. You can refer to Replicate Data to Kafka for detailed steps.
When creating the changefeed, you can control the throughput of TiCDC synchronizing data to Kafka by configuring the sink-uri parameter. Specifically, you can add the following configuration items to the sink-uri parameter:

kafka.producer.config.bootstrap.servers=<kafka-broker-list>
kafka.producer.config.max.request.size=<max-request-size>

Here, <kafka-broker-list> is the list of brokers in your Kafka cluster, and <max-request-size> is the maximum size of a single Kafka message. By appropriately adjusting these two parameters, you can increase the throughput of TiCDC synchronizing data to Kafka.

Note that if you encounter issues while using TiCDC to synchronize data to Kafka, you can refer to the Troubleshoot TiCDC section for troubleshooting.

translator_bot · June 22, 2024, 11:36am

| username: vcdog | Original post link

Why is the default size of this parameter set to 10MB in version v6.5.0? Isn’t the larger the value, the higher the throughput?

translator_bot · June 22, 2024, 11:36am

| username: vcdog | Original post link

Add the following to the configuration file:
kafka.producer.config.max.request.size=1048576

When updating the task again, an error occurs:
Error: component TiCDC changefeed's config file ./kafka-to-tianjin-pro-01.toml contained unknown configuration options: sink.kafka.producer.config.max.request.size

translator_bot · June 22, 2024, 11:36am

| username: 大鱼海棠 | Original post link

The high efficiency of the 6.5 Kafka sink is because previously a single sink writing to the downstream would slow down the entire pipeline. Now, by default, multiple concurrent writes are used. You can directly test the efficiency.

translator_bot · June 22, 2024, 11:36am

| username: sdojjy | Original post link

ticdc has made many performance optimizations for the Kafka sink in version 6.5, and these optimizations do not need to be enabled manually.

Regarding your error, it is because these parameters currently need to be written into the sink-uri, such as --sink-uri “kafka://127.0.0.1:9092/test?max-message-bytes=671088&protocol=canal-json”. Efforts are underway to support these parameters directly in the configuration file.

translator_bot · June 22, 2024, 11:36am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.