The performance of writing from one table to another in TiDB using PySpark is very poor

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 用pytispark写入从tidb的一个表写到另一个表性能很差

| username: TiDBer_luFyExXZ

As shown in the picture,
Using PySpark to execute “insert into table t1 as select * from s1” for such a simple data import, it takes about two hours to import 40 million records. Are there any parameters that can be optimized?

| username: TiDBer_小阿飞 | Original post link

The link provided leads to an article on CSDN about “pytispark.”

| username: shiyuhang0 | Original post link

  1. The read during the write is used for conflict and other judgments.
  2. According to previous benchmarks, processing 40 million records with TiSpark/Spark JDBC should take minutes. If global transactions are not needed, Spark JDBC is recommended.
  3. What is your current level of concurrency? The benchmark used 32. If it is lower, you can increase the number of executors/cores to increase concurrency.
| username: 数据小黑 | Original post link

Do you have specific code? Let’s see which writing method you are using?

| username: xfworld | Original post link

Just use JDBC directly.