How to Parse Binary Data in Ticdc Canal-JSON Format

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Ticdc canal-json 格式中的 binary 数据如何解析

| username: TiDBer_o6NpEZY9

【TiDB Usage Environment】Test
【TiDB Version】v7.1.2
【Reproduction Path】What operations were performed to encounter the issue

【Encountered Issue: Phenomenon and Impact】
When importing data into Kafka through ticdc and then directly importing the raw data from Kafka into HDFS, it was found that the binary formatted data had garbled characters. I would like to know how to parse canal-json binary data when synchronizing it through Kafka or if there is a way to configure the binary formatted data in canal-json to use base64 encoding?

【Resource Configuration】
【Attachment: Screenshot/Log/Monitoring】
Table creation statement:

CREATE TABLE `all_field_test` (
  `id` bigint(20) NOT NULL,
  `col1` datetime DEFAULT NULL,
  `col2` datetime(6) DEFAULT NULL,
  `col3` time DEFAULT NULL,
  `col4` decimal(20,5) DEFAULT NULL,
  `col5` timestamp(6) NULL DEFAULT NULL,
  `col6` date DEFAULT NULL,
  `col7` year(4) DEFAULT NULL,
  `col8` binary(16) DEFAULT NULL,
  `col9` blob DEFAULT NULL,
  `col10` smallint(6) DEFAULT NULL,
  `col11` float DEFAULT NULL,
  `col12` double DEFAULT NULL,
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin

Test data in the database:

Original canal-json text data: The data in col8 and col9 has garbled characters. How should the downstream parse it?

{
    "id":0,
    "database":"test",
    "table":"all_field_test",
    "pkNames":[
        "id"
    ],
    "isDdl":false,
    "type":"UPDATE",
    "es":1700623634900,
    "ts":1700623635892,
    "sql":"",
    "sqlType":{
        "id":-5,
        "col1":93,
        "col2":93,
        "col3":92,
        "col4":3,
        "col5":93,
        "col6":91,
        "col7":12,
        "col8":2004,
        "col9":2004,
        "col10":5,
        "col11":7,
        "col12":8
    },
    "mysqlType":{
        "col8":"binary",
        "col12":"double",
        "col1":"datetime",
        "col2":"datetime",
        "col3":"time",
        "col6":"date",
        "col7":"year",
        "col11":"float",
        "id":"bigint",
        "col4":"decimal",
        "col5":"timestamp",
        "col9":"blob",
        "col10":"smallint"
    },
    "old":[
        {
            "id":"4",
            "col1":"2023-11-21 11:48:13",
            "col2":"2023-11-21 11:48:16.000000",
            "col3":"02:10:00",
            "col4":"12345.67890",
            "col5":"2023-11-21 11:48:40.000000",
            "col6":"2023-11-21",
            "col7":"2023",
            "col8":"00\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
            "col9":"abcd",
            "col10":"2",
            "col11":"3.14",
            "col12":"6.28"
        }
    ],
    "data":[
        {
            "id":"4",
            "col1":"2023-11-21 11:48:13",
            "col2":"2023-11-22 11:48:16.000000",
            "col3":"02:10:00",
            "col4":"12345.67890",
            "col5":"2023-11-21 11:48:40.000000",
            "col6":"2023-11-21",
            "col7":"2023",
            "col8":"00\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
            "col9":"abcd",
            "col10":"2",
            "col11":"3.14",
            "col12":"6.28"
        }
    ],
    "_tidb":{
        "commitTs":445808282147225603
    }
}
| username: dba远航 | Original post link

Consider whether it is caused by incorrect character encoding.

| username: 像风一样的男子 | Original post link

Binary type data is stored as hexadecimal data and needs to be converted using hex(field name).

| username: 有猫万事足 | Original post link

From the code, if the field type is BinaryMySQLType, it will be encoded into a string using ISO8859_1.

| username: TiDBer_o6NpEZY9 | Original post link

In the above case, the data of col9 decoded using ISO_8859_1 is \uFEFFabcd, which includes a Zero Width Non-Breaking Space (ZWNBSP, encoded as \uFEFF in Unicode). I am not sure whether this Zero Width Non-Breaking Space is handled by TiDB or added by the SQL client :joy:

| username: 有猫万事足 | Original post link

It’s hard to say. We need to check from the source. Before importing downstream into HDFS, decode the corresponding fields using ISO_8859_1 and output them. If they still match, then the change occurred when importing into HDFS.

Actually, the corresponding fields have already been encoded as ISO_8859_1 strings. I think when you import into HDFS, you can directly set these two columns as varchar and set the table’s character set to ISO_8859_1.