Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: client-go使用txn模式访问tikv返回错误Key xxx is out of [region YYY]
【TiKV Usage Environment】Production Environment
【TiKV Version】5.3.2
【Reproduction Path】
Occasional, cannot be reproduced
【Encountered Problem: Problem Phenomenon and Impact】
Using client-go txn-client to directly open a transaction and then execute get, it directly returns an error with the following message:
tikv aborts txn: Error(Txn(Error(Mvcc(Error(Kv(Error(Other("[components/tikv_kv/src/raftstore_impls.rs:35]: \"[components/raftstore/src/store/region_snapshot.rs:216]: Key XXXX is out of [region YYY]
【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】
According to the code location indicated by the error tikv/components/raftstore/src/store/region_snapshot.rs at 8e6e348505e7f1f7b5e023c00b30f90e8d1b4084 · tikv/tikv · GitHub
check_key_in_range(
key,
self.region.get_id(),
self.region.get_start_key(),
self.region.get_end_key(),
)
.map_err(|e| EngineError::Other(box_err!(e)))?;
Here, the KeyOutOfRegion error is mapped to Other error.
Questions:
- This error is thrown when checking the key range after obtaining the region snapshot. Is this issue expected in raftstore?
The key range has actually been checked and passed in the previous code, but it fails again when using the snapshot to read in the final step. Is this issue expected? I want to know if it is expected, what causes this error? Is it RegionSplit? If so, shouldn’t this error be converted to RegionErr instead of Other type? In my understanding, it should be converted to RegionErr to let client-go retry?
Hello, when accessing TiKV in txn mode, the region-related processing is as follows:
- The txn retrieves the corresponding region information from PD based on the key.
- The txn sends a request to the corresponding TiKV with the region metadata (such as version information).
- After receiving the request, TiKV checks whether the region information matches. This process is mainly to ensure that the corresponding region in TiKV has not undergone changes such as split or merge.
- TiKV starts processing the request. When it gets the snapshot and processes our key, it finds that the key is no longer in this region.
As you mentioned, after step 3, the region might have undergone changes, such as a split, which could lead to our issue. This can be observed by searching the logs related to this region in TiKV.
Why is it not a Region Error but an Other Error? I guess the reasons might be as follows:
- First, region errors generally get exposed during step 3, but our processing has already reached step 4, which is very core. Previously, handling region errors at this stage was not considered.
- The second and most crucial reason is that directly returning a region error here is not feasible because the fact is that the key is not in this region. The original design intention might be to draw attention to the error and further investigate whether the region information has changed, rather than an internal logic issue.
- The region split operation itself is instantaneous, so the probability of hitting this logic is relatively high but not particularly high, because the get snapshot operation is also instantaneous. Therefore, as you mentioned, this scenario is difficult to reproduce. I guess this logic was not there initially, but was added later when someone encountered it. Considering the second reason, it might be the safest way for users not to report a Region Error, as the system might indeed have an issue.