Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: Tidb id暴增问题
[TiDB Usage Environment] Production, Testing Environment
[TiDB Version] v5.1.0
[Problem Encountered]
The operations colleague wrote test data into production, causing the production IDs to change irregularly. Subsequently, the IDs of the data inserted by the program also increased automatically, making it difficult to determine the starting and ending IDs when exporting data (exporting too much data at once would crash the database, previously each task was 1-2 billion). Is it possible in TiDB to reorder the production table IDs to increment from 1 again?
Time ID Change Situation
2022-04-11 17:53:21 1461225468
2022-04-11 17:53:21 33326912000827430
2022-04-11 17:53:21 33326912000827431
2022-06-30 20:25:16 1774796495
2022-06-30 20:25:17 3644757138153474
2022-07-06 10:40:03 3645241416307060
2022-07-06 10:40:03 5672171301314802
2022-07-26 00:00:00 5730052744046899
2022-07-29 17:20:00 5730052763127046
2022-07-29 17:20:00 14107755464138770
2022-08-14 06:40:12 14107755543038090
2022-08-14 06:40:12 19744916779638790
2022-08-17 16:00:04 19747433398195963
2022-08-17 16:00:04 20973009007616032
2022-08-17 17:38:51 20993124881286978
2022-08-17 17:38:51 21000691401433170
2022-08-26 10:36:26 21000691450716350
2022-08-26 10:36:26 24169068668137543
2022-09-01 00:00:00 24189201369196762
…
2022-10-03 10:11:15 34295779513113914
Currently, it should not be supported, and even if it is supported, the cost would be very high. Changing the primary key would lead to a large amount of data migration. You can refer to the previous approach of data comparison. I don’t know how you are doing it, but you can add a binary search judgment using shell or python. Find the maximum value, then divide it by 2 to get the middle value. Then, from the minimum value to the middle value, count the data volume. If it is less than 1 billion, export it directly. If it is more than 1 billion, repeat the above operation. This ensures that the data exported each time is less than 1 billion. If exporting all data, just loop.
My current method is quite time-consuming. Since the data IDs are continuously increasing, when encountering a skipped ID, I keep incrementing the ID value and comparing the creation date of this ID. This way, I can infer whether there are any missing IDs. Then, I query the start and end IDs within a certain range, for example, within a data volume of 500 million records, as a single export.
select count(id) from xxx where id > 5730055060275318 and id <= 5730055600000000;
-- Result: 460307296
select created_time from xx where id = 5730055060275318;
-- Result: 2022-07-19 04:50:39
select created_time from produce_param where id = 5730055600000000;
-- Result: 2022-07-21 06:58:31
The official documentation at AUTO_INCREMENT | PingCAP 文档中心 also explains the ID cache, but in our production environment, it was caused by test personnel transferring test data to production.
The official statement is that businesses should not rely on auto-increment IDs, which cannot be resolved. If you want auto-increment IDs to be continuous, you can only have one TiDB on the front end.
How should we understand that there can only be one TiDB? In my environment, the database tables are all written by the same account, although there are many factories. Initially, there were no restrictions on test data, which led to very large IDs for the test data in the database. As more data was inserted, the IDs continued to grow. After repeating this process several times, the aforementioned situation occurred. Currently, we have imposed restrictions on production test data, but I have been thinking about whether there is a good method to reset the table data IDs to make them consecutive.
Auto-increment IDs in TiDB cannot be continuous by default. If you need them to be continuous, you can only deploy a TiDB with a 3 PD and 5 TiKV architecture on the front end.
You can only delete the table and re-import it to make the ID not so large.
Using Snowflake IDs or other ID generators, auto-increment IDs can also easily create hot write issues, and in cases of frequent inserts, conflicts may occur (I have encountered this issue, though the specific reason is unclear).
Yes, a very large ID was inserted in the middle, and the auto-increment will follow the largest one to generate. Currently, there is no solution. As the enthusiast said, you can only export and reload the data once.
At that time, a segment comparison data script was modified. Regarding the current issue, see if it helps you:
minid=`mysql -h -u -p -e"select min(id) from t"`
maxid=`mysql -h -u -p -e"select max(id) from t"`
ksid=$minid
let zjid=($maxid-$ksid)/2
let jzid=$ksid+$zjid
while true
do
sycount=`mysql -h -u -p -e"select count(*) from t where id>=$ksid and id <=$maxid"`
if [[ "$sycount" -gt "10" ]]; then
count=`mysql -h -u -p -e"select count(*) from t where id>=$ksid and id <$jzid"`
while true
do
if [[ "$count" -gt "10" ]]; then
let zjid=$zjid/2
let jzid=$ksid+$zjid
count=`mysql -h -u -p -e"select count(*) from t where id>=$ksid and id <$jzid"`
else
echo "select * from t where id>=$ksid and id <$jzid"
let ksid=$jzid
let zjid=($maxid-$ksid)/2
let jzid=$ksid+$zjid
break
fi
done
else
echo "select * from t where id>=$ksid and id <$jzid"
exit
fi
done