EpochNotMatch: Current Epoch of Region

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: EpochNotMatch current epoch of region

| username: wfxxh

【TiDB Usage Environment】Production Environment
【TiDB Version】tidb: v5.1.1, tispark: v2.5.0, spark: 3.0.1
【Encountered Problem】tispark reports an error when reading tikv
【Problem Phenomenon and Impact】

tispark log:

22/07/06 15:12:14 ERROR DAGIterator: Process region tasks failed, remain 0 tasks not executed due to
com.pingcap.tikv.exception.GrpcException: retry is exhausted.
at com.pingcap.tikv.util.ConcreteBackOffer.doBackOffWithMaxSleep(ConcreteBackOffer.java:148)
at com.pingcap.tikv.util.ConcreteBackOffer.doBackOff(ConcreteBackOffer.java:119)
at com.pingcap.tikv.region.RegionStoreClient.handleCopResponse(RegionStoreClient.java:703)
at com.pingcap.tikv.region.RegionStoreClient.coprocess(RegionStoreClient.java:675)
at com.pingcap.tikv.operation.iterator.DAGIterator.process(DAGIterator.java:219)
at com.pingcap.tikv.operation.iterator.DAGIterator.lambda$submitTasks$1(DAGIterator.java:90)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.pingcap.tikv.exception.GrpcException: message: “EpochNotMatch current epoch of region 275292521 is conf_ver: 437 version: 2285, but you sent conf_ver: 437 version: 2282”
epoch_not_match {
current_regions {
id: 275292521
start_key: “t\200\000\000\000\000\000\000\377\027_r\200\000\000\000\000\377\025\271o\000\000\000\000\000\372”
end_key: “t\200\000\000\000\000\000\000\377\031_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\025!\003\200\000\000\377\000\000\000\000\000\003\200\000\377\000\000\000\000\000\006\003\200\377\000\000\000\000\000\000N\000\376”
region_epoch {
conf_ver: 437
version: 2285
}
peers {
id: 275292522
store_id: 274433474
}
peers {
id: 275292524
store_id: 16
}
peers {
id: 275293104
store_id: 1
}
}
current_regions {
id: 275294317
end_key: “t\200\000\000\000\000\000\000\377\027_r\200\000\000\000\000\377\025\271o\000\000\000\000\000\372”
region_epoch {
conf_ver: 437
version: 2285
}
peers {
id: 275294318
store_id: 274433474
}
peers {
id: 275294319
store_id: 16
}
peers {
id: 275294320
store_id: 1
}
}
}
at com.pingcap.tikv.region.RegionStoreClient.handleCopResponse(RegionStoreClient.java:704)
… 9 more

tikv log:

[endpoint.rs:632] [error-response] [err=“Region error (will back off and retry) message: "EpochNotMatch current epoch of region 275292521 is conf_ver: 437 version: 2285, but you sent conf_ver: 437 version: 2282" epoch_not_match { current_regions { id: 275292521 start_key: 7480000000000000FF175F728000000000FF15B96F0000000000FA end_key: 7480000000000000FF195F698000000000FF0000010380000000FF0000152103800000FF0000000000038000FF0000000000060380FF0000000000004E00FE region_epoch { conf_ver: 437 version: 2285 } peers { id: 275292522 store_id: 274433474 } peers { id: 275292524 store_id: 16 } peers { id: 275293104 store_id: 1 } } current_regions { id: 275294317 end_key: 7480000000000000FF175F728000000000FF15B96F0000000000FA region_epoch { conf_ver: 437 version: 2285 } peers { id: 275294318 store_id: 274433474 } peers { id: 275294319 store_id: 16 } peers { id: 275294320 store_id: 1 } } }”]

| username: Meditator | Original post link

Check the heatmap on PD’s dashboard to see if there are any hotspots. If there are hotspots, it could cause the Raft log apply to lag behind.

| username: wfxxh | Original post link

There isn’t. Querying the tidb_hot_regions table also shows no hot regions.

| username: Meditator | Original post link

Take a look at this issue: Java客户端连接tispark报错: tco count should be positive · Issue #558 · pingcap/tispark · GitHub. It seems to be a similar problem.

| username: wfxxh | Original post link

It’s different.

| username: 小王同学Plus | Original post link

This error is occasional, right? It seems that these errors are expected and require the client to actively retry. If the application has a retry mechanism, the impact can be ignored. In the future, TiSpark will automatically retry when reading data, instead of reporting an error.

| username: wfxxh | Original post link

Hello, I found the corresponding table based on the reported region id.

| username: Gin | Original post link

When a query request to TiDB reaches TiKV and the region splits, it results in the inability to access data using the old region metadata, causing this error. This error is rarely seen when accessing TiKV through TiDB because TiDB implements a backoff mechanism. This mechanism allows TiDB to fetch the latest metadata from PD after changes such as region leader scheduling, region splitting, and region merging, and then re-access TiKV using the original startTS. This can to some extent prevent client errors, with the client only experiencing increased latency.

| username: wfxxh | Original post link

Hello, I used tispark 3 and specified spark.tispark.stale_read for reading, but this error still occurs.

| username: 数据小黑 | Original post link

To supplement the original poster’s question, here is some monitoring information:
wf-resource-PD_2022-07-27T08_48_20.905Z.json (4.0 MB)
wf-resource-TiKV-Details_2022-07-27T08_44_26.176Z.json (17.5 MB)

| username: wfxxh | Original post link

The issue was resolved after upgrading the TiDB version to v5.4.2.