High Concurrency [Batch on Duplicate Key Update]: AUTO_RANDOM(5) Generates Duplicates

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 高并发【批量on duplicate key update】时:AUTO_RANDOM(5) 生成重复

| username: meiliangdeng

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] 6.5.3 (3 TiKV machines)
[Reproduction Path] Not yet reproducible
[Encountered Problem: Problem Phenomenon and Impact]:
I have a table in TiDB4 [with the same table structure in TiDB6.5.3]:

CREATE TABLE `ads_test` (
  `id` bigint(20) NOT NULL /*T![auto_rand] AUTO_RANDOM(5) */ COMMENT 'Auto-increment Id',
  `index_dt` datetime(3) DEFAULT '1970-01-01 00:00:00.000',
  `store_id` varchar(64) DEFAULT NULL,
  `visit_date_hour` datetime(3) DEFAULT NULL,
  `visit_date` varchar(64) DEFAULT NULL,
  `visit_hour` varchar(64) DEFAULT NULL,
  `add_payment_info_cnt` bigint(20) DEFAULT NULL,
  `etl_datetime` datetime(3) DEFAULT NULL,
  `dt` varchar(64) DEFAULT '1970-01-01',
  `hr` varchar(64) DEFAULT '00',
  PRIMARY KEY (`id`),
  KEY `dws_index_dt` (`index_dt`,`dt`,`hr`),
  UNIQUE KEY `uk_store_id_visit_date_visit_hr` (`store_id`,`visit_date_hour`,`index_dt`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin /*T![auto_rand_base] AUTO_RANDOM_BASE=2011348235 */ COMMENT='Test Table' 

I use TiDB 4’s TiCDC to synchronize to TiDB6.5.3 through the following link:
TiDB4 —> TiCDC —> Kafka ----> Java code (JDBC write) -----> TiDB6.5.3

When writing to TiDB6 via JDBC, I use the syntax insert into ...... on duplicate key update .... and batch insert data [each batch varies from 50 to 100 rows]. However, the id field is not included [expecting a new id to be automatically generated in TiDB6 via AUTO_RANDOM].
The data volume is over 90 million rows. After synchronization, I found that some data in TiDB6 was missing and could not be found. However, I have logged every synchronized data at the Java layer and found that the Java layer has records of synchronization, but there is no data in TiDB6.

Currently, it is suspected that the data was written to TiDB6, but subsequent data inserted generated duplicate ids via AUTO_RANDOM, and then the previous data was overwritten by the on duplicate key update mechanism.

| username: RenlySir | Original post link

You can check the official website: AUTO_RANDOM | PingCAP 归档文档站
AUTO_RANDOM(5). TiDB can easily assign integers beyond this range to the AUTO_RANDOM(5) column, causing exceptions when the application reads the column’s value. In this case, you can use AUTO_RANDOM(5, 54) instead of AUTO_RANDOM(5) to ensure that TiDB does not assign integers greater than 9007199254740991 (2^53-1).

Note

Selection of shard bit length (S):

  • Since the total bit length is fixed at 64 bits, the number of shard bits will affect the number of auto-increment bits: when the number of shard bits increases, the number of auto-increment bits decreases, and vice versa. Therefore, you need to balance “randomness of automatically assigned values” and “available space.”
  • The best practice is to set the shard bits to log(2, x), where x is the number of storage engines in the current cluster. For example, if there are 16 TiKV nodes in a TiDB cluster, the shard bits can be set to log(2, 16), which is 4. After all Regions are evenly scheduled to each TiKV, the bulk write load can be evenly distributed to different TiKV nodes to maximize resource utilization.

Selection of value range length (R):

  • Typically, the R parameter needs to be set when the application’s numeric type cannot represent the full 64-bit integer.
  • For example: the JSON number type range is [-(2^53)+1, (2^53)-1], and TiDB can easily assign integers beyond this range to the AUTO_RANDOM(5) column, causing exceptions when the application reads the column’s value. In this case, you can use AUTO_RANDOM(5, 54) instead of AUTO_RANDOM(5) to ensure that TiDB does not assign integers greater than 9007199254740991 (2^53-1).
| username: 有猫万事足 | Original post link

Is it possible that the unique key conflict caused the insertion to fail, rather than the primary key? As of now, to support your suspicion, it is necessary to further rule out the impact of the unique key.

| username: zhanggame1 | Original post link

It should be fine. It mentions that reading might have issues with extremely long IDs, but writing to the database will definitely be fine.

| username: zhanggame1 | Original post link

An error will occur when a unique index conflict occurs during a Java insert.

| username: Hacker_4BgeX58z | Original post link

With only over 90 million rows of data, if AUTO_RANDOM results in duplicates, that would be a major bug, which is unlikely.
You can modify the code to avoid using the “insert into … on duplicate key update” syntax and directly use “insert into”. If there are indeed exceptions due to duplicates, it will be easier to troubleshoot the issue. Each time you use “insert into”, you can also print out the ID to first confirm whether AUTO_RANDOM is indeed generating duplicates.

| username: redgame | Original post link

No, check the Java logs.

| username: meiliangdeng | Original post link

You can look at the key information given in my post: When using “insert into … on duplicate key update,” I do not include the id field, meaning the id field is generated by TiDB itself. If the unique key of the data being inserted is duplicated at this time, “on duplicate key update” will merge the inserted data with the existing duplicate data. This is a feature of “on duplicate key update.”

| username: meiliangdeng | Original post link

The “on duplicate key update” feature will not report an error. If it’s just “insert into,” an error will only be reported when the ID is duplicated.

| username: meiliangdeng | Original post link

Well, I have been repeatedly testing it, and I haven’t been able to reproduce the issue yet, which is why I posted yesterday to seek advice. I will continue to try and verify it.

| username: meiliangdeng | Original post link

Because the data sent by TiCDC to Kafka is still retained, I have replayed this batch of data [90+ million] several times, but it has not been reproduced so far.

| username: meiliangdeng | Original post link

At the moment, it’s just a guess. Besides manual deletion and AUTO_RANDOM(5) ID duplication, I can’t think of any other scenario that would cause this data to be lost. There are currently relatively few resources on TiDB’s ON DUPLICATE KEY UPDATE and AUTO_RANDOM(5).

| username: meiliangdeng | Original post link

During the entire process from TiDB4 to TiDB6, every time a piece of data is inserted, the ID from TiDB4 is printed out. This is simply because in TiDB6, I set the ID to be generated by auto_random, so the data ID from TiDB6 is not printed.

| username: knull | Original post link

If AUTO_RANDOM(5) is used, it means the maximum value can reach 2^59, which is an extremely large value.
A simple verification would be: max(id) & 0x07ffffffffffffff to see how large it can get.

Secondly, if the ID generation is duplicated, for example, if it overflows and starts from the beginning, your 90 million is not at that scale (2^32 is 2 billion, your 90 million is not enough to consider).
So either there was already data in the table?
Or as that gentleman mentioned, a unique key (uk) duplication?