Error Reading Data on k8s with Spark 3.2.1, Spark 3.0.0, and TiDB 6.1.0

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: spark 3.2.1 tipsark 3.0.0 tidb 6.1.0 on k8s读取数据错误

| username: 数据小黑

spark 3.2.1 tipsark 3.0.0 tidb 6.1.0 on k8s data reading error, no error after switching to spark 3.0.3.
The error description for spark 3.2.1 tipsark 3.0.0 tidb 6.1.0 on k8s is as follows:
Environment:
spark 3.2.1 + tipsark 3.0.0
tidb deployed on k8s, pd 3 + tidb 1 + tikv 3

CREATE TABLE `sbtest_t_t` (
  `id` int(11) NOT NULL,
  `k` int(11) NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */,
  KEY `k_1` (`k`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin

Write spark code

        String pd_addr = "basic-pd.tidb-cluster:2379";
        String tidb_addr = "basic-tidb.tidb-cluster";

        SparkConf conf = new SparkConf()
                .set("spark.sql.extensions", "org.apache.spark.sql.TiExtensions")
                .set("spark.sql.catalog.tidb_catalog", "org.apache.spark.sql.catalyst.catalog.TiCatalog")
                .set("spark.sql.catalog.tidb_catalog.pd.addresses", pd_addr)
                .set("spark.tispark.pd.addresses", pd_addr);
        SparkSession spark = SparkSession
                .builder()
                .appName("RdbToRdbProcess")
                .config(conf)
                .getOrCreate();

            // Use TiSpark to batch write DataFrame into TiDB
            Map<String, String> tiOptionMap = new HashMap<String, String>();
            tiOptionMap.put("tidb.addr", tidb_addr);
            tiOptionMap.put("tidb.port", "4000");
            tiOptionMap.put("tidb.user", username);
            tiOptionMap.put("tidb.password", password);
            tiOptionMap.put("replace", "true");
            tiOptionMap.put("spark.tispark.pd.addresses", pd_addr);

            spark.sql("use tidb_catalog.sbtest2");
            // Get current timestamp
            long ttl = System.currentTimeMillis();

            spark.sql("select * from sbtest_t_t where id = 100").show();

Error during runtime:

22/06/21 05:47:39 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (172.26.2.55 executor 1): com.pingcap.tikv.exception.TiClientInternalException: Error reading region:
	at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:190)
	at com.pingcap.tikv.operation.iterator.DAGIterator.readNextRegionChunks(DAGIterator.java:167)
	at com.pingcap.tikv.operation.iterator.DAGIterator.hasNext(DAGIterator.java:113)
	at org.apache.spark.sql.execution.ColumnarRegionTaskExec$$anon$2.proceedNextBatchTask$1(CoprocessorRDD.scala:359)
	at org.apache.spark.sql.execution.ColumnarRegionTaskExec$$anon$2.hasNext(CoprocessorRDD.scala:369)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.ExecutionException: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
	at java.base/java.util.concurrent.FutureTask.report(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.get(Unknown Source)
	at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:185)
	... 23 more
Caused by: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
	at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:233)
	at com.pingcap.tikv.operation.iterator.DAGIterator.lambda$submitTasks$1(DAGIterator.java:91)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	... 3 more
Caused by: com.pingcap.tikv.exception.GrpcException: Request range exceeds bound, request range:[7480000000000022FF015F728000000000FF08744A0000000000FA, 7480000000000022FF015F728000000000FF08744C0000000000FA), physical bound:[7480000000000022FF015F728000000000FF0513960000000000FB, 7480000000000022FF015F728000000000FF08744B0000000000FB)
	at com.pingcap.tikv.region.RegionStoreClient.handleCopResponse(RegionStoreClient.java:733)
	at com.pingcap.tikv.region.RegionStoreClient.coprocess(RegionStoreClient.java:680)
	at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:220)
	... 7 more

tidb logs:

# Time: 2022-06-21T05:50:41.645798217Z
# Txn_start_ts: 434055580189720578
# User@Host: root[root] @ 172.26.0.0 [172.26.0.0]
# Conn_ID: 4847320156352928759
# Query_time: 6.759797825
# Parse_time: 0.000074124
# Compile_time: 0.000347601
# Rewrite_time: 0.000242356
# Optimize_time: 0.000041402
# Wait_TS: 0.000357856
# Cop_time: 0.315299844 Process_time: 0.862 Wait_time: 0.003 Request_count: 3 Process_keys: 551276 Total_keys: 551279 Rocksdb_key_skipped_count: 551276 Rocksdb_block_cache_hit_count: 2021
# DB: sbtest2
# Is_internal: false
# Digest: 00f78d8dc447bf40093b4e5a2b0e92099ea1c4745b8f59e14973f4bd18e91550
# Stats: sbtest_t:pseudo
# Num_cop_tasks: 3
# Cop_proc_avg: 0.287333333 Cop_proc_p90: 0.344 Cop_proc_max: 0.344 Cop_proc_addr: basic-tikv-0.basic-tikv-peer.tidb-cluster.svc:20160
# Cop_wait_avg: 0.001 Cop_wait_p90: 0.001 Cop_wait_max: 0.001 Cop_wait_addr: basic-tikv-0.basic-tikv-peer.tidb-cluster.svc:20160
# Mem_max: 270091409
# Prepared: false
# Plan_from_cache: false
# Plan_from_binding: false
# Has_more_results: false
# KV_total: 2.798632842
# PD_total: 0.000348634
# Backoff_total: 0.002
# Write_sql_response_total: 0
# Result_rows: 0
# Succ: false
# IsExplicitTxn: false
# Plan: tidb_decode_plan('8wXweTAJMjdfMQkwCTAJTi9BCTAJdGltZTo2LjQ3cywgbG9vcHM6MSwgcHJlcGFyZTogMS4xMXMsIGluc2VydDo1LjM2cwkxMDAuNiBNQglOL0EKMQkzMV83CTAJMTAwMDAJZGF0YTpUYWJsZUZ1bGxTY2FuXzYJNDEzNTU4CWoUMzE3LjltFWx8NDA1LCBjb3BfdGFzazoge251bTogMywgbWF4OiA1ODIBKiRtaW46IDMxNS4xAQ4kYXZnOiA0MzYuNgEOCHA5NRkoUGF4X3Byb2Nfa2V5czogMjE4NTgyLAEjThcACHRvdAUXDDogODYFZwERGHdhaXQ6IDMBWgxycGNfEY4BDCUoFCAxLjMxcwWyfHJfY2FjaGVfaGl0X3JhdGlvOiAwLjAwfQkyMDAuMyBNKR8oMgk0M182CTFfMAkpIfBAdGFibGU6c2J0ZXN0X3QsIGtlZXAgb3JkZXI6ZmFsc2UsIHN0YXRzOnBzZXVkbwk1NTEyNzYJdGlrdl90YXNrOnsB4iUfBDIyJRAhHggxMjcBswhwODARFiEYDSEoaXRlcnM6NTUyLCABQmBzOjN9LCBzY2FuX2RldGFpbDoge3RvdGFsJQ4IZXNzLT8JegAsIRc6HAAoX3NpemU6IDEyMzQhYgA0ESQpdwU4oDksIHJvY2tzZGI6IHtkZWxldGVfc2tpcHBlZF9jb3VudDogMCwga2V5PhYABT4YNiwgYmxvY0EQOWQNNyAyMDIxLCByZWEuSQAFD2BieXRlOiAwIEJ5dGVzfX19CU4vQQlOL0EK')
# Plan_digest: 008eb1fb01becb5754e1b45518519660d20ae1ee6f7671d9b403ba347d5af606
/* ApplicationName=DBeaver 21.1.3 - SQLEditor <Script-176.sql> */ insert into sbtest2.sbtest_t_t select * from sbtest2.sbtest_t;

tikv logs:

[2022/06/21 05:50:04.497 +00:00] [INFO] [apply.rs:1395] ["execute admin command"] [command="cmd_type: BatchSplit splits { requests { split_key: 7480000000000022FF2300000000000000F8 new_region_id: 724041 new_peer_ids: 724042 new_peer_ids: 724043 new_peer_ids: 724044 } right_derive: true }"] [index=8] [term=6] [peer_id=724018] [region_id=724017]
[2022/06/21 05:50:04.498 +00:00] [INFO] [apply.rs:2238] ["split region"] [keys="key 7480000000000022FF2300000000000000F8"] [region="id: 724017 start_key: 7480000000000022FF015F728000000000FF0BCA210000000000FB region_epoch { conf_ver: 5 version: 33407 } peers { id: 724018 store_id: 1 } peers { id: 724019 store_id: 6001 } peers { id: 724020 store_id: 6002 }"] [peer_id=724018] [region_id=724017]
[2022/06/21 05:50:04.502 +00:00] [INFO] [peer.rs:3561] ["moving 0 locks to new regions"] [region_id=724017]
[2022/06/21 05:50:04.502 +00:00] [INFO] [peer.rs:3656] ["insert new region"] [region="id: 724041 start_key: 7480000000000022FF015F728000000000FF0BCA210000000000FB end_key: 7480000000000022FF2300000000000000F8 region_epoch { conf_ver: 5 version: 33408 } peers { id: 724042 store_id: 1 } peers { id: 724043 store_id: 6001 } peers { id: 724044 store_id: 6002 }"] [region_id=724041]
[2022/06/21 05:50:04.502 +00:00] [INFO] [peer.rs:251] ["create peer"] [peer_id=724042] [region_id=724041]
[2022/06/21 05:50:04.502 +00:00] [INFO] [raft.rs:2646] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {724044, 724042, 724043} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }"] [raft_id=724042] [region_id=724041]
[2022/06/21 05:50:04.502 +00:00] [INFO] [raft.rs:1120] ["became follower at term 5"] [term=5] [raft_id=724042] [region_id=724041]
[2022/06/21 05:50:04.502 +00:00] [INFO] [raft.rs:384] [newRaft] [peers="Configuration { incoming: Configuration { voters: {724044, 724042, 724043} }, outgoing: Configuration { voters: {} } }"] ["last term"=5] ["last index"=5] [applied=5] [commit=5] [term=5] [raft_id=724042] [region_id=724041]
[2022/06/21 05:50:04.502 +00:00] [INFO] [raw_node.rs:315] ["RawNode created with id 724042."] [id=724042] [raft_id=724042] [region_id=724041]
[2022/06/21 05:50:04.506 +00:00] [INFO] [raft.rs:1565] ["[logterm: 5, index: 5, vote: 0] cast vote for 724044 [logterm: 5, index: 5] at term 5"] ["msg type"=MsgRequestPreVote] [term=5] [msg_index=5] [msg_term=5] [from=724044] [vote=0] [log_index=5] [log_term=5] [raft_id=724042] [region_id=724041]
[2022/06/21 05:50:04.511 +00:00] [INFO] [raft.rs:1364] ["received a message with higher term from 724044"] ["msg type"=MsgRequestVote] [message_term=6] [term=5] [from=724044] [raft_id=724042] [region_id=724041]
[2022/06/21 05:50:04.511 +00:00] [INFO] [raft.rs:1120] ["became follower at term 6"] [term=6] [raft_id=724042] [region_id=724041]
[2022/06/21 05:50:04.511 +00:00] [INFO] [raft.rs:1565] ["[logterm: 5, index: 5, vote: 0] cast vote for 724044 [logterm: 5, index: 5] at term 6"] ["msg type"=MsgRequestVote] [term=6] [msg_index=5] [msg_term=5] [from=724044] [vote=0] [log_index=5] [log_term=5] [raft_id=724042] [region_id=724041]
[2022/06/21 05:51:09.719 +00:00] [INFO] [kv.rs:1117] ["call CheckLeader failed"] [address=ipv4:172.26.2.39:52190] [err=Grpc(RemoteStopped)]
| username: Vain | Original post link

Is it related to the Spark version? No issues with 3.0, but errors occur with 3.2?

| username: shiyuhang0 | Original post link

Have you tried running it outside of k8s to see if there are any issues?

| username: tidb狂热爱好者 | Original post link

Is it resolved?

| username: Vain | Original post link

I verified it locally, and there is no problem in non-k8s mode.

| username: 数据小黑 | Original post link

Are you verifying the combination of Spark 3.2.1, Tipsark 3.0.0, and TiDB 6.1.0? If your verification is fine and aligns with my understanding, I also initially suspect it might be an issue with running on Kubernetes. The main problem right now is that there’s no clear troubleshooting approach, and in the same environment, downgrading Spark to 3.0.3 works without issues. I’m indeed a bit unsure about how to proceed.

| username: shiyuhang0 | Original post link

Is it stable and reproducible on k8s? You can first open an issue on GitHub, and it’s best to clearly describe the environment.

| username: 数据小黑 | Original post link

Okay, I’ll raise an issue and contact the official team.

| username: yilong | Original post link

Has this issue been submitted? What is the link? Thanks.

| username: Vain | Original post link

TiSpark I am using version 3.0.1.
Can this issue be consistently reproduced?

| username: 数据小黑 | Original post link

It can be stably reproduced on Spark 3.2.1, Tipsark 3.0.0, and TiDB 6.1.0 on Kubernetes.

| username: shiyuhang0 | Original post link

Could you please help test if Spark 3.2.1, Spark 3.0.1, and TiDB 6.1.0 on Kubernetes also have similar issues?

| username: 数据小黑 | Original post link

There are no issues with Spark 3.2.1, Spark 3.0.1, and TiDB 6.1.0 on Kubernetes. Several features have been tested and passed.

| username: 数据小黑 | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.