Low import speed using TiSpark. How to optimize cluster performance and increase speed?

aaarnell · June 22, 2023, 7:59am

TiDB version:

v7.1.0

Problem:

Low speed of data import using TiSpark. Suboptimal cluster configuration

Resource allocation:

4 hosts. Characteristics of each:

2 CPU Intel Xeon 2.2GHz, 40 cores, 80 threads
768GB RAM
10 HDD of 5.5TB each
2 network interface card of 10Gbit/sec each

Cluster Configuration:
1 host: 3 PD
2 host: 1 TiDB, 5 TiKV, 1 TiFlash
3 host: 1 TiDB, 4 TiKV, 1 TiFlash
4 host: 1 TiDB, 4 TiKV, 1 TiFlash

In the cluster_template.yaml file, the topology of the deployed TiDB cluster.
cluster_template.yaml (8.8 KB)

Detailed description of the problem:

We try to import data from another system according to the instructions using TiSpark. But it turns out slowly.

For example: the customer table from the TPC-DS test suite occupies about 30GB in text format and has 65,000,000 rows. In the current configuration of the Td b cluster and with 3 SparkExec units configured, the table import time is about 30 minutes. Or 17MB/sec and 36000 rows/sec.

How can I increase this speed? Perhaps it is worth optimizing the cluster configuration?

TiDB-AI-Assistant · June 22, 2023, 7:59am

Based on the information provided, it seems that the cluster configuration is not the bottleneck for the slow import speed. The hardware resources allocated to the cluster are more than sufficient for the import task.

One possible reason for the slow import speed could be the default configuration of TiSpark. By default, TiSpark uses a small number of Spark partitions, which may not be enough to fully utilize the available cluster resources.

To increase the import speed, you can try adjusting the following TiSpark configuration parameters:

spark.sql.shuffle.partitions: This parameter controls the number of Spark partitions used for shuffling data during the import process. Increasing this value can help improve the import speed. You can try setting it to a value that is equal to or greater than the number of TiKV regions in your cluster.
spark.default.parallelism: This parameter controls the default number of Spark partitions used for parallel processing. You can try setting it to a value that is equal to or greater than the number of CPU cores in your cluster.

You can set these parameters in the spark-defaults.conf file or pass them as command-line arguments when submitting the TiSpark job.

Additionally, you can try using TiDB Lightning for data import instead of TiSpark. TiDB Lightning is a standalone tool that can import data into TiDB at a much faster speed than TiSpark. You can refer to the documentation for more information on how to use TiDB Lightning.