Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 大小写敏感问题
[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path]
Data exported from 5.4.3 using dumpling and migrated to 7.5.1. When importing through lightning, an error “Duplicate entry” occurs, indicating duplicate data.
The table’s collation is COLLATE=utf8_general_ci.
In version 5.4.3, it is case-sensitive. Below is a screenshot of the old table’s data:
However, in version 7.5.1, it is case-insensitive and cannot be inserted. What is the issue here?
What are the table validation rules in the new 7.5.1 environment?
Starting from version 4.0, TiDB supports collation rules, but they are disabled by default. The corresponding parameter is new_collation_enabled
.
I checked that the old version has new_collation_enabled
set to false by default, and the collation is utf8_general_ci
, so the collation does not take effect and it defaults to case sensitivity.
In the new cluster, which is version 7.5.1, new_collation_enabled
is true by default, and the collation is utf8_general_ci
, so the collation takes effect, meaning it is case insensitive.
I changed the switch to default case sensitivity, but it still reports an error when importing.
update mysql.tidb set VARIABLE_VALUE = ‘False’ where VARIABLE_NAME = ‘new_collation_enabled’ limit 1;
It seems that this takes effect upon initialization, and any changes made afterward are ineffective. You can only reinitialize the cluster, or try pausing and then modifying the collation of the corresponding downstream tables before restarting?
Mark: Precautions for Version Upgrade
Should we communicate with the business about this? The actual MAC address of the network card does not have lowercase letters, right? Also, can the issue of duplicate values in TiDB Lightning be resolved through conflict data detection configuration?
Then we can only reinstall the database.
I don’t understand, is it caused by this parameter? Can you provide a test example?
It should be the character set rules, will the collation also have an impact?
The parameter new_collations_enabled_on_first_bootstrap
is set during the first initialization of the cluster. In older versions, the default is false, which means that case sensitivity is not supported for modifying the collation of a table. In the new version, new_collations_enabled_on_first_bootstrap
is true by default, allowing you to set the collation rules for a table, meaning you can specify whether a table is case-sensitive or not. For my table, I set it to utf8_general_ci
, which is case-insensitive. This issue arises because our data is disordered.
Collation determines whether case sensitivity is applied.
You can do this by directly modifying the backup content of Dumpling and changing the abnormal character set to utf8_bin.
I did this when synchronizing MySQL 8 with DM 
Then you just need to change all the tables to the _bin collation. 
When it comes to migration, character sets and collation parameters are very important 