Issues with Syncing VARBINARY and MEDIUMBLOB to Kafka Using TiCDC

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC同步varbinary以及mediumblob至kafka的问题

| username: TIDB救我狗命

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 6.0.1
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
[Attachment: Screenshot/Log/Monitoring]
The data written to Kafka is garbled.

| username: TIDB救我狗命 | Original post link

The message of Kafka is a string, it seems to be directly cast from binary. Is there any solution?

| username: nongfushanquan | Original post link

What protocol are you using? Is it canal-json? We have a documentation PR describing this issue, but it hasn’t been merged yet.

| username: TIDB救我狗命 | Original post link

I am using the protobuf protocol.

| username: redgame | Original post link

You can try using a byte array to represent the message.

| username: Anna | Original post link

character-set = “auto”

Specifies the character set of the source data file. Lightning will convert the source file from the specified character set to UTF-8 encoding during the import process.

This configuration item is currently only used to specify the character set of CSV files. The following options are supported:

- utf8mb4: The source data file uses UTF-8 encoding.

- GB18030: The source data file uses GB-18030 encoding.

- GBK: The source data file uses GBK encoding (GBK encoding is an extension of the GB-2312 character set, also known as Code Page 936).

- binary: Do not attempt to convert encoding (default).

Leaving this configuration empty will default to “binary”, meaning no attempt to convert encoding.

It is important to note that Lightning does not make assumptions about the character set of the source data file and will only transcode and import data based on this configuration.

If the character set setting does not match the actual encoding of the source data file, it may result in import failure, missing data, or garbled data.