Case Sensitivity Issue

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 大小写敏感问题

| username: 像风一样的男子

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path]
Data exported from 5.4.3 using dumpling and migrated to 7.5.1. When importing through lightning, an error “Duplicate entry” occurs, indicating duplicate data.
The table’s collation is COLLATE=utf8_general_ci.
In version 5.4.3, it is case-sensitive. Below is a screenshot of the old table’s data:


However, in version 7.5.1, it is case-insensitive and cannot be inserted. What is the issue here?

| username: 小龙虾爱大龙虾 | Original post link

What are the table validation rules in the new 7.5.1 environment?

| username: 像风一样的男子 | Original post link

Starting from version 4.0, TiDB supports collation rules, but they are disabled by default. The corresponding parameter is new_collation_enabled.

I checked that the old version has new_collation_enabled set to false by default, and the collation is utf8_general_ci, so the collation does not take effect and it defaults to case sensitivity.
In the new cluster, which is version 7.5.1, new_collation_enabled is true by default, and the collation is utf8_general_ci, so the collation takes effect, meaning it is case insensitive.

| username: 像风一样的男子 | Original post link

I changed the switch to default case sensitivity, but it still reports an error when importing.
update mysql.tidb set VARIABLE_VALUE = ‘False’ where VARIABLE_NAME = ‘new_collation_enabled’ limit 1;

| username: 林夕一指 | Original post link

It seems that this takes effect upon initialization, and any changes made afterward are ineffective. You can only reinitialize the cluster, or try pausing and then modifying the collation of the corresponding downstream tables before restarting?

| username: MrSylar | Original post link

Mark: Precautions for Version Upgrade

| username: tony5413 | Original post link

Should we communicate with the business about this? The actual MAC address of the network card does not have lowercase letters, right? Also, can the issue of duplicate values in TiDB Lightning be resolved through conflict data detection configuration?

| username: 像风一样的男子 | Original post link

Then we can only reinstall the database.

| username: 小龙虾爱大龙虾 | Original post link

:thinking: I don’t understand, is it caused by this parameter? Can you provide a test example?

| username: zhaokede | Original post link

It should be the character set rules, will the collation also have an impact?

| username: 像风一样的男子 | Original post link

The parameter new_collations_enabled_on_first_bootstrap is set during the first initialization of the cluster. In older versions, the default is false, which means that case sensitivity is not supported for modifying the collation of a table. In the new version, new_collations_enabled_on_first_bootstrap is true by default, allowing you to set the collation rules for a table, meaning you can specify whether a table is case-sensitive or not. For my table, I set it to utf8_general_ci, which is case-insensitive. This issue arises because our data is disordered.

| username: 像风一样的男子 | Original post link

Collation determines whether case sensitivity is applied.

| username: 林夕一指 | Original post link

You can do this by directly modifying the backup content of Dumpling and changing the abnormal character set to utf8_bin.
I did this when synchronizing MySQL 8 with DM :sunglasses:

| username: 小龙虾爱大龙虾 | Original post link

Then you just need to change all the tables to the _bin collation. :joy_cat:

| username: 濱崎悟空 | Original post link

When it comes to migration, character sets and collation parameters are very important :grin: