How to Improve the Slow Write Speed of TiSpark?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tispark的写入速度很慢,如何提升?

| username: 钟鸣泽pk

Version

tispark 3.3_2.12-3.1.5
tidb7.1
spark3.3.1

TiDB Deployment Topology

image

Problem Description

Recently, while using tispark for data writing, I found the speed to be very slow. Has the community evaluated what the normal speed should be, or is there a reference value? It has been over an hour, and the extraction is still not complete. The data source is a tpcds100 sale_stores table with a data volume of 287,997,024 rows. Is there any configuration that needs to be enabled? The speed is indeed far from expected.
image
image

Spark submission command

bin/spark-shell --master yarn --executor-cores 2 --executor-memory 6g --num-executors 10 

Scala code

     df.write.
      format("tidb").
      option("database", "test").
      option("table", "store_sales").
      options(tidbOptions).
      mode("append").
      save()

Finally, an error occurred, but the data was successfully written. The writing speed is approximately 50,000 rows/s.


      24/02/02 15:16:28 WARN KVErrorHandler: Stale Epoch encountered for region [{Region[4502900364] ConfVer[34877] Version[65155] Store[402369509] KeyRange[t\200\000\000\000\000\0019E]:[t\200\000\377\377\377\377\377\374_r\200\000\000\000\000\000\351\305]}]
24/02/02 15:16:28 WARN KVErrorHandler: Failed to send notification back to driver since CacheInvalidateCallBack is null in executor node.


13.shade.io.grpc.StatusRuntimeException: UNKNOWN: region 4502900364 is hot
14.shade.io.grpc.StatusRuntimeException: UNKNOWN: region 4502900364 is hot
15.shade.io.grpc.StatusRuntimeException: UNKNOWN: region 4502900364 is hot
24/02/02 15:16:27 WARN TiSession: failed to scatter region: 4502900364
com.pingcap.tikv.exception.GrpcException: retry is exhausted.

| username: ShawnYan | Original post link

Please also share the table structure and the number of TiDB nodes.

| username: zhang_2023 | Original post link

What is the architecture like?

| username: 钟鸣泽pk | Original post link

创建表 store_sales (
ss_sold_date_sk INT,
ss_sold_time_sk INT,
ss_item_sk INT,
ss_customer_sk INT,
ss_cdemo_sk INT,
ss_hdemo_sk INT,
ss_addr_sk INT,
ss_store_sk INT,
ss_promo_sk INT,
ss_ticket_number BIGINT,
ss_quantity INT,
ss_wholesale_cost DECIMAL(10,2),
ss_list_price DECIMAL(10,2),
ss_sales_price DECIMAL(10,2),
ss_ext_discount_amt DECIMAL(10,2),
ss_ext_sales_price DECIMAL(10,2),
ss_ext_wholesale_cost DECIMAL(10,2),
ss_ext_list_price DECIMAL(10,2),
ss_ext_tax DECIMAL(10,2),
ss_coupon_amt DECIMAL(10,2),
ss_net_paid DECIMAL(10,2),
ss_net_paid_inc_tax DECIMAL(10,2),
ss_net_profit DECIMAL(10,2)
);

| username: 钟鸣泽pk | Original post link

Supplemented.

| username: wfxxh | Original post link

Could you please share the configuration parameters for the hard disk and TiDB?

| username: TiDBer_jYQINSnf | Original post link

Check the monitoring, specifically the TiKV monitoring, to see where the bottleneck is. How many TiKVs are there, and what are their configurations?

| username: zhaokede | Original post link

It still depends on the architecture and hardware situation. For example, on an HDD, with limited IO, it will definitely be slow.

| username: dba远航 | Original post link

It feels okay, not that slow.

| username: xfworld | Original post link

Such a good resource, why mix deployment… :rofl:

Plan it well~

The data structure does not solve the skew problem well, leading to hotspots

13.shade.io.grpc.StatusRuntimeException: UNKNOWN: region 4502900364 is hot
14.shade.io.grpc.StatusRuntimeException: UNKNOWN: region 4502900364 is hot
15.shade.io.grpc.StatusRuntimeException: UNKNOWN: region 4502900364 is hot

These need to be adjusted one by one, which is quite troublesome.