Lighting: Significant Data Differences Between Two Import Modes, What Causes This?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: lighting 两种数据导入模式,导入后数据差异很大,什么原因导致的啊

| username: xingzhenxiang

【TiDB Usage Environment】Testing
【TiDB Version】
【Reproduction Path】What operations were performed when the issue occurred
【Encountered Issue: Issue Phenomenon and Impact】
【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】

Logical Import
Image
Physical Import


The record statistics are the same

There is a significant difference in data size and region data volume. What is the reason for this?

| username: come_true | Original post link

Logical import only includes data, while physical import includes objects and other elements, making the physical import larger.

| username: come_true | Original post link

Didn’t see clearly, are you saying the data is larger after logical import?

| username: xingzhenxiang | Original post link

So, I’m looking for the reason.

| username: xingzhenxiang | Original post link

Yes, logical import is larger than physical import, so I want to ask what the reason is.

| username: come_true | Original post link

That might mean data loss.

| username: xingzhenxiang | Original post link

It’s possible. I’m more interested in knowing whether the physical import is compressed.

| username: come_true | Original post link

There are specific parameters for compression. If you export and import in the test environment, check if there are any compression parameters added in the command.

| username: xingzhenxiang | Original post link

No compression command added.

| username: zhanggame1 | Original post link

First, check the data volume.

| username: xingzhenxiang | Original post link

Currently running batch statistics, there is a large table with identical data records.

| username: xingzhenxiang | Original post link

The data statistics are out, and the result file md5sum is consistent. This makes it even more strange.

| username: okenJiang | Original post link

TiKV itself compresses data, and the physical mode of Lightning sorts the data before importing it into TiKV, so the compression effect is better than the logical mode.

If the checksum is successful, you don’t need to worry about data loss. It’s normal to have discrepancies in the final storage size in TiKV with different insertion methods.

| username: xingzhenxiang | Original post link

Could you please provide the specific documentation? Thank you.

| username: tidb菜鸟一只 | Original post link

The physical mode directly operates on TiKV, while the logical mode goes through the TiDB server and then transfers to TiKV. Comparatively, the physical mode is faster and occupies less space, but it also has higher restrictions.

| username: xingzhenxiang | Original post link

I see the limitation issue, that’s why I created both logical and physical models, right? The physical model takes up less space. Is there a specific document explaining this? Thanks.

| username: h5n1 | Original post link

Physical import directly inserts sorted SST files without duplicates, resulting in denser data and fewer regions since each region contains more data. Logical import involves merging LSM from top to bottom. Ideally, if both methods are just inserting data, the difference shouldn’t be significant. During the import process, region splits occur, and not every region has the same amount of data, leading to more regions. It is speculated that region splits might cause SST file redundancy (even if there’s no data). You can try compacting the logical import database to see the reduction (it may require multiple executions).

tikv-ctl --host xxxx:20160 compact -c write -d kv --bottommost force --threads 8
tikv-ctl --host xxxx:20160 compact -c default -d kv --bottommost force --threads 8
| username: xingzhenxiang | Original post link

The logical cluster in the same set of clusters has already been cleaned up.

| username: 随缘天空 | Original post link

Actually, there’s no need to worry about this. As long as the total amount of data and the data content are consistent, it’s fine. Different import modes may have different principles or compression methods, so discrepancies are normal. Even with the same import method, the data might differ with each operation. You can use the sync-diff-inspector tool to compare the data.

| username: WalterWj | Original post link

The lighting local mode import occurs after compact, and the logical mode will wait for the cluster to compact and compress.