The -r parameter does not work when using Dumpling for export

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 使用 Dumpling 导出-r参数参数不生效

| username: 麻烦是朋友

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] V6.5.2
[Reproduction Path] What operations were performed to encounter the issue
[Encountered Issue: Issue Phenomenon and Impact]
The -r parameter of Dumpling export is not effective
Export name parameter settings:
–filetype csv --sql " select ****
WHERE
b.accounts_time BETWEEN concat( date_add( curdate()- DAY ( curdate())+ 1, INTERVAL - 1 MONTH ), ’ 09:00:00’ )
AND concat( date_add( curdate(), INTERVAL - DAY ( curdate())+ 1 DAY ), ’ 08:59:59’ )
AND b.active = 1;
" -r 500 --output-filename-template=“bill_center_send_date '+%Y-%m-%d'.{{.Index}}”
Specify 500 rows size limit:
Official documentation:

There are two explanations:
-r is used to enable concurrent acceleration within the table for export. The default value is 0, which means it is not enabled. A value greater than 0 means it is enabled, and the value is of INT type. When the data source is TiDB, setting the -r parameter greater than 0 means using TiDB region information to divide intervals, while also reducing memory usage. The specific value does not affect the division algorithm. For MySQL data sources where the table’s primary key is INT, this parameter also has an intra-table concurrency effect.
The -F option is used to specify the maximum size of a single file, in MiB, and can accept inputs like 5GiB or 8KB. If you want to use TiDB Lightning to load this file into a TiDB instance, it is recommended to keep the value of the -F option at 256 MiB or below.
Note
If the size of the exported single table exceeds 10 GB, it is strongly recommended to use the -r and -F parameters.

Improve Dumpling’s export efficiency through concurrency

By default, the exported files will be stored in the ./export-<current local time> directory. Common options are as follows:

  • -t is used to specify the number of threads for export. Increasing the number of threads will increase Dumpling’s concurrency and improve export speed, but it will also increase database memory consumption, so it should not be set too high.
  • The -r option is used to specify the maximum number of records in a single file, or in other words, the number of rows in the database. Once enabled, Dumpling will enable intra-table concurrency to speed up the export of large tables. When the upstream is TiDB and the version is v3.0 or later, setting the -r parameter greater than 0 means using TiDB region information to divide intra-table concurrency, and the specific value does not affect the division algorithm. For MySQL upstream where the table’s primary key is int, this parameter also has an intra-table concurrency effect.

Experimental test results:
[2024/01/04 18:00:44.433 +08:00] [INFO] [collector.go:255] [“backup success summary”] [total-ranges=1] [ranges-succeed=1] [ranges-failed=0] [total-take=948.487739ms] [total-rows=1343] [total-kv-size=487.2kB] [average-speed=513.7kB/s]
[2024/01/04 18:00:44.434 +08:00] [INFO] [client.go:779] [“[pd] stop fetching the pending tso requests due to context canceled”] [dc-location=global]
[2024/01/04 18:00:44.434 +08:00] [INFO] [client.go:716] [“[pd] exit tso dispatcher”] [dc-location=global]
[2024/01/04 18:00:44.434 +08:00] [INFO] [main.go:81] [“dump data successfully, dumpling will exit now”]
Exported 1343 rows and found that the file was not split.
total 488
-rw-rw-r-- 1 tidb tidb 487197 Jan 4 18:00 bill_center_send_2024-01-04.000000000.csv
The number of rows in a single file is as follows:
1344 “444433330189”
:set nu
WeChat Image_20240104181810
The teacher mentioned that the maximum number of rows in a single file, -r -F file size, whichever is reached first takes precedence.
How should we understand the -r parameter here, and how to limit the number of rows in the exported file?

| username: 江湖故人 | Original post link

This line refers to the rows in the source database, not the rows in the export file. I understand it as similar to using limit in a select statement to reduce the memory usage of a single task, and combined with -t, it can increase concurrency.

| username: Kongdom | Original post link

It seems fine. The command specifies -r as 200,000, but the actual exported file has 1,343 lines, which is less than 200,000 lines.

| username: FutureDB | Original post link

I tried using the -r parameter, but it didn’t work. However, using the -F parameter works.

| username: dba-kit | Original post link

Note this sentence, for exporting data from TiDB, as long as -r is greater than 0, the specific value does not matter, it will partition according to the region information. That is: When using -r, the value of -r and the value of -F are both invalid and will not affect the size of the generated file.

PS: Only when using -F alone, it can be similar to mydumper, splitting the SQL file according to the set value. However, at this time, each table can only be exported by one thread, and table-level concurrency will not be enabled. For large tables, the export efficiency is much lower than using -r.

| username: Jellybean | Original post link

The documentation is indeed not accurate. From the actual effect, -r is mainly used to enable concurrent acceleration within the table for export, functioning as a 0/1 switch.

| username: FutureDB | Original post link

In that case, the actual effect of the -r parameter is to enable concurrent acceleration within the table for export, while the -F parameter is the one that can truly control the size of each exported file. So, in this sense, the -r parameter and the -t parameter have somewhat overlapping functions, as the -t parameter can also specify the exact number of export threads.

| username: Kongdom | Original post link

:thinking: Here’s another documentation issue discovered, progress +1

| username: dba-kit | Original post link

It’s different. Look at my answer above. For the export source being TiDB, as long as -r is enabled, the value of -F will also be ignored. -t refers to the number of threads enabled, which means how many regions are exported at once.

| username: dba-kit | Original post link

Actually, the explanation in the dumpling help is quite clear: “For TiDB v3.0+, specifying this will make dumpling split the table with each file corresponding to one TiDB region (regardless of the number of rows).”

| username: 路在何chu | Original post link

I encountered this problem before as well, but didn’t take it seriously.

| username: 麻烦是朋友 | Original post link

Line 1343: I set it to 500 lines, shouldn’t it split into 3 files for me?

| username: 麻烦是朋友 | Original post link

You can watch the tutorial video, which mentions the number of rows for the exported file size.

| username: 烂番薯0 | Original post link

-r seems to be the maximum number of SQL statements imported at one time.

| username: 麻烦是朋友 | Original post link

The -r --rows option not working is a dumping product bug, but it does not affect normal use. If you have requirements for file splitting, try to use the -F option.

| username: zhanggame1 | Original post link

Learned.

| username: dba-kit | Original post link

Your statement is actually incorrect. It is explained in the documentation that -r has a special meaning for TiDB. When exporting data from MySQL, -r is still effective.

| username: dba-kit | Original post link

Your choice of this reply as a solution is potentially misleading.

| username: 哈喽沃德 | Original post link

--threads/-r followed by an integer value

| username: Kongdom | Original post link

:joy: Can anyone who has tested it share the real results?