TiSpark reports an error after importing data for a period of time

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tispark 导入数据时,导入一段时间tispark会报错。

| username: TiDBer_lei_xin

TiDB is version 5.0.3. When importing data into TiKV using TiSpark for a period of time, approximately around 2 hours, the import fails with the following error:

TiSpark error log:

com.pingcap.tikv.exception.TiBatchWriteException: Execution exception met.
	at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeys(TwoPhaseCommitter.java:304)
	at com.pingcap.tikv.TwoPhaseCommitter.prewriteSecondaryKeys(TwoPhaseCommitter.java:257)
	at com.pingcap.tispark.write.TiBatchWrite$$anonfun$doWrite$7.apply(TiBatchWrite.scala:305)
	at com.pingcap.tispark.write.TiBatchWrite$$anonfun$doWrite$7.apply(TiBatchWrite.scala:288)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: com.pingcap.tikv.exception.TiBatchWriteException: > max retry number 64, oldRegion={Region[501763] ConfVer[1691] Version[29004] Store[9] KeyRange[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000{\371J(\003\224\000\000\000\001\001\325\
\000]:[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000|1u\024\003\330\000\000\000\005\006\364\030\000]}, currentRegion={Region[510827] ConfVer[1691] Version[29007] Store[9] KeyRange[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000{\247|\021\003\254\000\000\000\002\361{2\000]:[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000|\002m+\003\354\000\000\000\f\314\200\351\000]}
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeys(TwoPhaseCommitter.java:283)
	... 15 more
Caused by: com.pingcap.tikv.exception.TiBatchWriteException: > max retry number 64, oldRegion={Region[501763] ConfVer[1691] Version[29004] Store[9] KeyRange[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000{\371J(\003\224\000\000\000\001\001\325\
\000]:[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000|1u\024\003\330\000\000\000\005\006\364\030\000]}, currentRegion={Region[510827] ConfVer[1691] Version[29007] Store[9] KeyRange[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000{\247|\021\003\254\000\000\000\002\361{2\000]:[t\200\000\000\000\000\000\002\022_i\200\000\000\000\000\000\000\001\003\200\000\000\000|\002m+\003\354\000\000\000\f\314\200\351\000]}
	at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:358)
	at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
	at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
	at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
	at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
	at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
	at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
	at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
	at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
	at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
	at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
	at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
	at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
	at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
	at com.pingcap.tikv.TwoPhaseCommitter.doPrewriteSecondaryKeysInBatchesWithRetry(TwoPhaseCommitter.java:366)
	at com.pingcap.tikv.TwoPhaseCommitter.retryPrewriteBatch(TwoPhaseCommitter.java:387)
| username: 数据小黑 | Original post link

What are the respective versions of TiSpark and Spark? Additionally, during the failure period, were there any error-level logs on TiKV and TiDB?

| username: ShawnYan | Original post link

+1, use the command select ti_version() to check the version information of TiSpark, which will help in pinpointing the issue.

| username: TiDBer_lei_xin | Original post link

TiSpark 2.41 and Spark 2.40. TiDB is not used. The TiSpark jar package communicates directly with TiKV.

| username: TiDBer_lei_xin | Original post link

There are quite a few TiKV nodes; I’ll check if there are any errors.

| username: TiDBer_lei_xin | Original post link

The error log for TiKV during that time period is as follows:
[2022/06/23 15:44:39.568 +08:00] [WARN] [endpoint.rs:639] [error-response] [err=“Key is locked (will clean up) primary_lock: 7480000000000002125F72820000001AC399FF lock_version: 434102246260080641 key: 7480000000000002125F698000000000000001038000000135D43A1803FE0000001BB4D1BD lock_ttl: 20000 txn_size: 16 min_commit_ts: 434102246260080641”]

| username: TiDBer_lei_xin | Original post link

As long as it runs for a period of time, about 2 hours, it will keep reporting errors.

| username: 数据小黑 | Original post link

According to my current understanding, here is the reply:
Spark write process: lock table → prewrite pk → prewrite sk → commit pk → release table lock → commit sk
Based on the logs you submitted, the error occurred during the prewrite sk stage, and there were retries. Logically, it seems that an SQL statement locked the data to be written, and after the optimistic lock retry limit was exceeded, it failed. As for whether this SQL is a query SQL or caused by region splitting, we cannot determine at the moment. We will consult an expert and get back to you. Additionally, if you have any suspicious slow SQL or other logs, you can submit them as well.

| username: TiDBer_lei_xin | Original post link

The application started reporting errors from around 13:50; then it kept reporting errors.

[2022/06/23 12:33:19.885 +08:00] [WARN] [peer.rs:3154] [“failed to propose confchange”] [error=“Other("[components/raftstore/src/store/peer.rs:2442]: [region 446061] 446296 unsafe to perform conf change [peer { id: 501424 store_id: 5 }, peer { id: 501423 store_id: 8 }, change_type: AddLearnerNode peer { id: 465879 store_id: 6 role: Learner }, change_type: AddLearnerNode peer { id: 457835 store_id: 9 role: Learner }], before: voters: 457835 voters: 446296 voters: 465879 learners: 501424 learners: 501423, after: voters: 446296 voters: 501424 voters: 501423 voters_outgoing: 446296 voters_outgoing: 465879 voters_outgoing: 457835 learners_next: 457835 learners_next: 465879, truncated index 11569, promoted commit index 0")”]
[2022/06/23 12:35:56.923 +08:00] [WARN] [peer.rs:2963] [“skip proposal”] [error_code=KV:Raftstore:Unknown] [err=“Other("[components/raftstore/src/store/peer.rs:2855]: [region 476655] 476656 log gap contains conf change, skip merging.")”] [peer_id=476656] [region_id=476655]
[2022/06/23 13:50:17.105 +08:00] [WARN] [peer.rs:3154] [“failed to propose confchange”] [error=“Other("[components/raftstore/src/store/peer.rs:2442]: [region 503053] 503055 unsafe to perform conf change [peer { id: 511610 store_id: 8 }, change_type: AddLearnerNode peer { id: 503054 store_id: 9 role: Learner }], before: voters: 503056 voters: 503054 voters: 503055 learners: 511610, after: voters: 503056 voters: 511610 voters: 503055 voters_outgoing: 503056 voters_outgoing: 503054 voters_outgoing: 503055 learners_next: 503054, truncated index 353, promoted commit index 0")”]
[2022/06/23 13:50:17.109 +08:00] [WARN] [peer.rs:2963] [“skip proposal”] [error_code=KV:Raftstore:Unknown] [err=“Other("[components/raftstore/src/store/peer.rs:2855]: [region 503053] 503055 log gap contains conf change, skip merging.")”] [peer_id=503055] [region_id=503053]
[2022/06/23 13:51:56.117 +08:00] [WARN] [peer.rs:3154] [“failed to propose confchange”] [error=“Other("[components/raftstore/src/store/peer.rs:2442]: [region 506587] 511380 unsafe to perform conf change [peer { id: 511656 store_id: 6 }, change_type: AddLearnerNode peer { id: 506588 store_id: 4 role: Learner }], before: voters: 510503 voters: 506588 voters: 511380 learners: 511656, after: voters: 511656 voters: 510503 voters: 511380 voters_outgoing: 506588 voters_outgoing: 511380 voters_outgoing: 510503 learners_next: 506588, truncated index 2112, promoted commit index 0")”]
[2022/06/23 13:51:56.117 +08:00] [WARN] [peer.rs:3154] [“failed to propose confchange”] [error=“Other("[components/raftstore/src/store/peer.rs:2442]: [region 506587] 511380 unsafe to perform conf change [peer { id: 511656 store_id: 6 }, change_type: AddLearnerNode peer { id: 506588 store_id: 4 role: Learner }], before: voters: 510503 voters: 506588 voters: 511380 learners: 511656, after: voters: 511656 voters: 510503 voters: 511380 voters_outgoing: 506588 voters_outgoing: 511380 voters_outgoing: 510503 learners_next: 506588, truncated index 2112, promoted commit index 0")”]
[2022/06/23 13:58:52.921 +08:00] [WARN] [endpoint.rs:639] [error-response] [err=“Key is locked (will clean up) primary_lock: 7480000000000002125F72820000000D61D20B lock_version: 434100894789795841 key: 7480000000000002125F69800000000000000103800000000BD3934C03C40000000D82DD2C lock_ttl: 20000 txn_size: 16 min_commit_ts: 434100894789795841”]
[2022/06/23 13:58:52.921 +08:00] [WARN] [endpoint.rs:639] [error-response] [err=“Key is locked (will clean up) primary_lock: 7480000000000002125F72820000000D61D20B lock_version: 434100894789795841 key: 7480000000000002125F698000000000000001038000000011F64EDE03920000000E1FB493 lock_ttl: 20000 txn_size: 16 min_commit_ts: 434100894789795841”]

| username: Vain | Original post link

Is there only retry information for this exception, without any information on why it is being retried? Generally, retries occur due to some errors. Are there any error messages?

| username: TiDBer_lei_xin | Original post link

That’s all the information I have. I didn’t see any error messages.

| username: yilong | Original post link

  1. Have you checked the monitoring of TiKV? Are there any resource bottlenecks?
  2. Can you try importing in small batches to see if it succeeds?
| username: Vain | Original post link

I was referring to the error message from TiSpark, which wasn’t fully captured. There should be more below.