Issues with TiDB's CDC to Kafka

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB的CDC到kafka的问题

| username: TiDBer_8rWAgqMU

I found that when TiDB’s CDC sinks to Kafka, there are some issues with which partition it sends to. You can’t choose a specific field of the table as the partition strategy; you can only use the primary key or unique index. This causes problems, as follows:

Assume the transaction table has the following data:
| Transaction ID | Customer ID | Amount Spent | Time Spent |
| 001 | CNT01 | 10 | 1s |
| 002 | CNT01 | 6 | 2s |
| 003 | CNT01 | 5 | 3s |
| 004 | CNT01 | 3 | 4s |

The requirement is to calculate the total amount spent by the customer in real-time. Clearly, the amounts calculated in chronological order are: 10, 16, 21, 24.
However, if the number of partitions in the sink Kafka is 3 and the data is output to the topic based on the primary key (i.e., transaction ID), and assuming partition2 is slow, the data from partition3 might reach Kafka’s calculation logic first. In this case, the amounts calculated in order would be: 10, 15, 21, 24.
This is obviously incorrect.
If we could specify the partition strategy when sinking to Kafka, for example, by specifying the customer ID as the partition strategy, then these 4 pieces of data would arrive in the same partition in chronological order. The consistency within the same partition in Kafka would ensure that the amounts calculated in chronological order match our expectations.

| username: Billmay表妹 | Original post link

Which version is it?

| username: TiDBer_8rWAgqMU | Original post link

v6.1.0
Additionally, I think this is unrelated to the version. It is only related to the scenario and Kafka’s partition. I just hope TiDB’s CDC can adapt to it.

| username: yilong | Original post link

Can custom distribution rules meet the requirements?
https://docs.pingcap.com/zh/tidb/stable/manage-ticdc#custom-distribution-rules-for-kafka-sink-topics-and-partitions

| username: TiDBer_8rWAgqMU | Original post link

It doesn’t meet the requirements. The link you provided is customized to the table level, schema.table. My suggestion is to be more detailed than the table level, down to the column level, schema.table.column.