How to export a CSV file to a single file using dumpling

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: dumpling导出csv文件怎么输出到一个文件

| username: 普罗米修斯

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.2.4
[Encountered Problem: Problem Description and Impact]
How to export a CSV file using dumpling so that it outputs to a single CSV file instead of many small CSV files;

| username: tidb菜鸟一只 | Original post link

The -F option is used to specify the maximum size of a single file, with the unit being MiB. It can accept inputs like 5GiB or 8KB. If you want to use TiDB Lightning to load this file into a TiDB instance

| username: 像风一样的男子 | Original post link

The -F option is used to specify the maximum size of a single file, with the unit being MiB. It can accept inputs like 5GiB or 8KB. If you want to use TiDB Lightning to load this file into a TiDB instance, it is recommended to keep the value of the -F option at 256 MiB or below.

If you want to write to a file directly, you can increase this parameter, but doing so will degrade write performance.

| username: 普罗米修斯 | Original post link

-F has been specified

| username: 普罗米修斯 | Original post link

The -F option specifies 256M, but the output is only around 100k.

| username: zhanggame1 | Original post link

Try removing -r 10000 and writing -t 1.

| username: 普罗米修斯 | Original post link

-t is the number of export threads. Is it related to the number of export concurrency and the small files exported?

| username: tidb菜鸟一只 | Original post link

I only exported one file…

| username: Fly-bird | Original post link

Set the number of file lines and the file size to be larger.

| username: 普罗米修斯 | Original post link

I tested it. With the export parameters unchanged and different targets filtered, some exports result in one large file, while others result in dozens of small files.

| username: 普罗米修斯 | Original post link

For the export conditions that result in small files, doubling the -r and -F parameters still results in small files.

| username: 像风一样的男子 | Original post link

A small table should be a separate file and cannot be exported together with other tables.

| username: 普罗米修斯 | Original post link

I exported the same table.

| username: 有猫万事足 | Original post link

Set -F to a large value, remove -r, or set it to 0.

If you want to enable intra-table concurrency, you definitely need to write multiple files when exporting. The larger the -r value, the more small files should be divided. Otherwise, writing to a single file is not good for concurrency, right?

* -ris used to enable intra-table concurrency to speed up export. The default value is0, which means it is not enabled. A value greater than 0 means it is enabled, and the value is of INT type. When the data source is TiDB, setting the -r parameter greater than 0 means using TiDB region information to divide the intervals, while reducing memory usage. The specific value does not affect the division algorithm. For scenarios where the data source is MySQL and the table's primary key is INT, this parameter also has an intra-table concurrency effect.

| username: 普罗米修斯 | Original post link

This table is very large. If table-level concurrency is not enabled, it cannot be exported. With the -r parameter, it can be exported in 1 minute.

| username: 普罗米修斯 | Original post link

Here is the situation where I removed the -r parameter and set -F to 2G for export.

| username: 有猫万事足 | Original post link

It’s better to increase concurrency. After all, this kind of text will most likely be exported as zip/tar in the end. :joy:
When importing, multiple files can also increase concurrency.

| username: 普罗米修斯 | Original post link

Export the CSV for the product to use. They want it to be in one file. It looks like some can be done, but some are still small files. I won’t research it further. After exporting, they can handle it themselves.

| username: Soysauce520 | Original post link

You can increase the number of rows exported, and splitting the files can facilitate faster import.

| username: tidb菜鸟一只 | Original post link

The main factor is the -r parameter. If your table corresponds to many regions, it will be divided into multiple files based on the distribution of the regions.