When importing CSV with Lightning, it cannot correctly distinguish between empty strings and null values ("" and NULL)

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: lightning导入CSV时,无法正确区分处理空字符串和空值(""和NULL)

| username: Prest13

[TiDB Usage Environment] Production Environment / Testing / Poc
Data Migration Import
[TiDB Version]
v7.1.0, v7.1.1, and v7.1.2
[Reproduction Path] Operations performed to reproduce the issue
Follow the standard process to import the CSV file, as shown in the CSV and Lightning configuration files below:
light.toml (429 bytes)
lightningtest.NTEST.csv (60 bytes)
[Encountered Issue: Problem Description and Impact]
In the CSV file, a comma is used as the field separator, and double quotes are used as the delimiter.
The table structure for testing is as follows:
create table NTEST(id bigint primary key, name varchar(32), remark varchar(128));

The expected result is that the CSV originates from another non-MYSQL database, where the same column contains both empty strings (i.e., “”) and NULL values. After import, “” should be treated as an empty string in TiDB, and NULL values should be treated as NULL.

The actual result is that both empty strings and NULL values are treated as NULL values and stored in the TiDB database, which does not match the expectation.
image

[Resource Configuration]
Not applicable
[Attachments: Screenshots/Logs/Monitoring]

| username: 胡杨树旁 | Original post link

In TiDB, null and ‘’ are not distinguished; both are considered null.

| username: 小龙虾爱大龙虾 | Original post link

The Lightning tool can distinguish between null values and empty strings. You can try these two parameters:

Whether the CSV file contains NULL.

If not-null = true, no columns in the CSV can be parsed as NULL.

not-null = false

If not-null = false (i.e., the CSV can contain NULL),

fields with the following value will be parsed as NULL.

null = ‘\N’

Whether to escape "" within fields

| username: Prest13 | Original post link

This is definitely unreasonable. In the end, we also used the method of replacing and then updating to write ‘’ into TiDB.

| username: Prest13 | Original post link

In the scenario I simulated, in the lightning configuration file:

Regarding the not-null parameter:
not-null is set to false because the CSV actually comes from another non-MySQL ecosystem database, and the NULL values need to be migrated to TiDB.

Regarding the null=‘\N’ parameter:
During our testing, since the source database exports NULL values to CSV as two consecutive delimiters, i.e., two commas ,,, if null=‘\N’, these NULL values cannot be processed and imported into TiDB. Therefore, null needs to be set to ‘’.

In summary, this import is limited by the CSV exported from the source database, where the exported CSV can clearly distinguish between empty strings and NULL values (,“”, and ,). We hope that lightning can also distinguish and process these values into TiDB according to this logic, but we have not found a feasible configuration solution.

Due to the project’s time constraints, we bypassed this issue by using sed for batch replacement and then performing a batch update after importing into TiDB.

| username: dba远航 | Original post link

We have encountered the same issue.

| username: 像风一样的男子 | Original post link

It is indeed a functional deficiency. If the data volume is large, manual replacement becomes very troublesome.

| username: Jellybean | Original post link

You can post an issue in the question section. If there are indeed optimization points in terms of functionality, it should be optimized in the future.

| username: 小龙虾爱大龙虾 | Original post link

It is necessary to recognize empty strings and null values according to the settings of Lightning.

| username: zhanggame1 | Original post link

It should be improved.

| username: zhaokede | Original post link

I’ve encountered this issue before. If the quantity is small, it’s generally better to fill the null values with a special string at the source and then process it after importing.

| username: oceanzhang | Original post link

I’ve encountered this before; you can only handle it manually for now.

| username: okenJiang | Original post link

Lightning indeed handles “”, and , in the same way :smiling_face_with_tear:

| username: andone | Original post link

This is quite troublesome to handle, it seems it can only be solved manually :thinking:

| username: Prest13 | Original post link

I would like to ask if there are any plans to solve this problem in this scenario in the future, or how to submit a request.

It seems that solving it at the code technical level shouldn’t be too complicated.

| username: okenJiang | Original post link

There is a solution for MySQL data sources; you can use Dumpling to specify the exported null values during the export process. Dumpling Overview | PingCAP Docs

For non-MySQL data sources, there doesn’t seem to be a plan at the moment. You can try submitting an issue at Sign in to GitHub · GitHub.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.