How to Perform Hash Calculation Based on Schema Name and Table Name When Data is Sunk to Kafka and Dispatcher is Set to "table"?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 数据Sink到Kafka, 当dispatcher ="table"时,如何根据Scheme名和table名做Hash计算的?

| username: ShengFeng

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v5.4.0
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Problem Phenomenon and Impact]
When sinking data to Kafka using the open-protocol, if the dispatcher is configured as “table”, table-level DDLs are broadcast to every partition (my understanding is that broadcasting is reasonable when the dispatcher is not set to “table”, but when it is set to “table”, table-level DDLs should be sent to the corresponding partition just like the data). Currently, to ensure the execution order of DML > DDL > DML and to ensure that DDLs are executed only in the partition processing the data, it is necessary to know the hash calculation algorithm based on the schema name and table name when the dispatcher is configured as “table”.

| username: Billmay表妹 | Original post link

When the dispatcher is configured as dispatcher = "table", the algorithm for calculating the hash based on the Schema name and Table name is as follows:

  1. Concatenate the Schema name and Table name into a single string, separated by a dot ., for example, testdb.testtable.

  2. Perform a hash calculation on the concatenated string to obtain a hash value.

  3. Take the modulo of the hash value with the number of partitions to get a partition number.

  4. Write the data to the corresponding partition.

Therefore, when using dispatcher = "table", table-level DDL will be broadcast to each partition, but only the partition that processes the table data will execute the DDL. This is because, in TiCDC, DDL events are sent to TiCDC through TiDB’s Binlog and then sent to Kafka through TiCDC. Before sending to Kafka, TiCDC will distribute the DDL events to the corresponding partition based on the dispatcher’s configuration, but only the partition that processes the table data will execute the DDL. This ensures the execution order of DML>DDL>DML and avoids executing DDL on partitions that do not need it, thereby improving processing efficiency.

| username: ShengFeng | Original post link

Hello, according to the provided algorithm, the calculated Partition is incorrect. Please help check JH_150167.at_product_cha_act (format: dbName.tableName), with 24 partitions. To which partition will the message be sent? Additionally, how should negative Hash values be handled?

| username: ljluestc | Original post link

You seem to be describing an issue related to the configuration of TiDB and its integration with Kafka as a data sink. Specifically, you mention that when the dispatcher is configured as dispatcher = “table”, table-level DDL (Data Definition Language) statements are being broadcast to every partition, which is not the expected behavior.

To ensure that DML (Data Manipulation Language) operations are executed in order before DDL operations and to restrict the execution of DDL statements to the specific partition handling the data, it is necessary to understand the hash calculation algorithm used when the dispatcher is set to “table”. It appears that this algorithm is based on the schema name and table name.

If you have any specific questions or need assistance with this issue, please provide more details or clarify your concerns.

| username: ShengFeng | Original post link

Yes, the effect I hope to achieve is that when the allocator is configured as dispatcher = “table”, table-level DDL statements are only executed in the specific partition that processes the data, and not executed in each partition. Therefore, I would like to understand the hash calculation algorithm used when the dispatcher is set to “table”. When a partition receives a DDL, it performs a hash calculation to obtain the partition code. If it is the data processing partition, the DDL is executed; otherwise, the DDL is not executed.

| username: ljluestc | Original post link

In TiDB, when the dispatcher is configured as dispatcher = “table”, the hash calculation algorithm used to determine the partition code for DDL statements depends on the partitioning method used by the table. TiDB has different partitioning methods, including range partitioning, hash partitioning, and list partitioning.

Range Partitioning: If the table is partitioned using range partitioning, hash calculation is not directly used to determine the partition code. Instead, the partition code is determined based on the value ranges specified for each partition. The DDL statement is executed only if the value range of the partition matches the condition specified in the DDL statement.

Hash Partitioning: If the table is partitioned using hash partitioning, the partition code is determined through hash calculation. The partition code is calculated based on the hash value of the partition key specified in the DDL statement. The DDL statement is executed if the calculated partition code matches the partition code of the partition currently being processed; otherwise, it is not executed.

List Partitioning: If the table is partitioned using list partitioning, hash calculation is not used to determine the partition code. Instead, the partition code is determined based on the specific values specified for each partition. The DDL statement is executed only if the specified value matches the condition specified in the DDL statement.