TiSpark 2.5.0 can access TiDB, but TiSpark 3.1 and 3.0 encounter errors when accessing

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tispark2.5.0访问tidb可行,tispark3.1及3.0 访问出错

| username: TiDBer_oJBjP6il

Hi, TiDB engineers,

I encountered an issue where only a specific version of TiSpark can successfully access TiDB during my use of TiDB. The details are as follows:

Environment:

  • Scala 2.12.10
  • Spark 3.1.1
  • Hadoop 2.7.3
  • TiDB 4.0.16
  • tispark-assembly-2.5.0.jar or tispark-assembly-3.1_2.12-3.1.1.jar

Configuration:

--conf spark.tispark.pd.addresses=${TIDB_PD_ADDRESSES}
--conf spark.sql.extensions=org.apache.spark.sql.TiExtensions
--conf spark.tispark.isolation_read_engines=tikv
--conf spark.sql.catalog.tidb_catalog=org.apache.spark.sql.catalyst.catalog.TiCatalog
--conf spark.sql.catalog.tidb_catalog.pd.addresses=${TIDB_PD_ADDRESSES}

When starting a Spark job to connect to TiDB and using Spark SQL to query, the purpose of Spark SQL is to query certain fields of the entire table, similar to:

spark.sql("select filedList from tidb_catalog.database.table");

When using tispark-assembly-2.5.0.jar, the Spark job runs successfully without errors. However, when using tispark-assembly-3.1_2.12-3.1.1.jar, the Spark job reports an error:

22/11/16 05:57:33 ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.SparkException: Job aborted.
...
Caused by: com.pingcap.tikv.exception.TiClientInternalException: Error reading region:
...
Caused by: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
...
Caused by: com.pingcap.tikv.exception.GrpcException: retry is exhausted.
...
Caused by: com.pingcap.tikv.exception.GrpcException: message: "region 12189882 is missing"
region_not_found {
region_id: 12189882
}
...
Exception in thread "main" org.apache.spark.SparkException: Application application_1663728370843_4161693 finished with failed status
...

The error indicates that there is an issue with reading a region, specifically that a region is missing. This problem does not occur with the older version of TiSpark. Could you please provide guidance on how to resolve this issue?

| username: shiyuhang0 | Original post link

Is this error intermittent or does it occur every time?

| username: shiyuhang0 | Original post link

This log appears to be from the driver side. Do you have the logs from the Spark executor?

The logs printed by the driver side show the reason for the last failure after retries have been exhausted.

Caused by: com.pingcap.tikv.exception.GrpcException: retry is exhausted.
at com.pingcap.tikv.util.ConcreteBackOffer.doBackOffWithMaxSleep(ConcreteBackOffer.java:153)
at com.pingcap.tikv.util.ConcreteBackOffer.doBackOff(ConcreteBackOffer.java:124)
at com.pingcap.tikv.region.RegionStoreClient.handleCopResponse(RegionStoreClient.java:709)
at com.pingcap.tikv.region.RegionStoreClient.coprocess(RegionStoreClient.java:681)
at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:220)
… 7 more
Caused by: com.pingcap.tikv.exception.GrpcException: message: “region 12189882 is missing”
region_not_found {
region_id: 12189882
}

The details of the retry failures should be in the Spark executor logs.

| username: TiDBer_oJBjP6il | Original post link

Consistently reproducible

| username: TiDBer_oJBjP6il | Original post link

Hello, I found the following logs in the executor:

22/11/21 20:21:31 WARN RegionStoreClient: Re-splitting region task due to region error: EpochNotMatch current epoch of region 9438746 is conf_ver: 17 version: 3556, but you sent conf_ver: 17 version: 3397
22/11/21 20:21:31 ERROR Utils: Aborting task
com.pingcap.tikv.exception.TiClientInternalException: Error reading region:
	at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:190)
	at com.pingcap.tikv.operation.iterator.DAGIterator.readNextRegionChunks(DAGIterator.java:167)
	at com.pingcap.tikv.operation.iterator.DAGIterator.hasNext(DAGIterator.java:113)
	at org.apache.spark.sql.execution.ColumnarRegionTaskExec$$anon$2.proceedNextBatchTask$1(CoprocessorRDD.scala:359)
	at org.apache.spark.sql.execution.ColumnarRegionTaskExec$$anon$2.hasNext(CoprocessorRDD.scala:374)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:277)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at com.pingcap.tikv.operation.iterator.DAGIterator.doReadNextRegionChunks(DAGIterator.java:185)
	... 21 more
Caused by: com.pingcap.tikv.exception.RegionTaskException: Handle region task failed:
	at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:233)
	at com.pingcap.tikv.operation.iterator.DAGIterator.lambda$submitTasks$1(DAGIterator.java:91)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	... 3 more
Caused by: com.pingcap.tikv.exception.GrpcException: retry is exhausted.
	at com.pingcap.tikv.util.ConcreteBackOffer.doBackOffWithMaxSleep(ConcreteBackOffer.java:153)
	at com.pingcap.tikv.util.ConcreteBackOffer.doBackOff(ConcreteBackOffer.java:124)
	at com.pingcap.tikv.region.RegionStoreClient.handleCopResponse(RegionStoreClient.java:709)
	at com.pingcap.tikv.region.RegionStoreClient.coprocess(RegionStoreClient.java:681)
	at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:220)
	... 7 more
Caused by: com.pingcap.tikv.exception.GrpcException: message: "EpochNotMatch current epoch of region 9438746 is conf_ver: 17 version: 3556, but you sent conf_ver: 17 version: 3397"
epoch_not_match {
  current_regions {
    id: 9438746
    start_key: "t\200\000\000\000\000\000\000\377/_r\200\000\000\000\002\377\256\377\235\000\000\000\000\000\372"
    end_key: "t\200\000\000\000\000\000\000\377/_r\200\000\000\000\002\377\2755\a\000\000\000\000\000\372"
    region_epoch {
      conf_ver: 17
      version: 3556
    }
    peers {
      id: 9438748
      store_id: 2003
    }
    peers {
      id: 9438750
      store_id: 6947716
    }
    peers {
      id: 10583292
      store_id: 10530118
    }
    peers {
      id: 10614114
      store_id: 10613406
      role: Learner
    }
  }