Will there be an epochNotMatch error when writing to the target region during the tikv merge process?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv merge 过程中,target region 的写入是否会有 epochNotMatch 错误

| username: ylldty

In this document, it is pointed out that during the split process, some proposed raft logs are skipped, resulting in an epochNotMatch error.

This process is similar during the merge, but the document:

describes that the Target Region has no service downtime.

When the Target Region applies CommitMerge, the epoch version also increments. Why does it avoid issues similar to the split process?

| username: Billmay表妹 | Original post link

In TiDB, when performing a Region merge operation, there is indeed no downtime. This is because during the merge process, TiKV uses a mechanism called “Atomic Region Merge” to ensure data consistency and availability, thereby avoiding service unavailability.

In the Atomic Region Merge mechanism, TiKV divides the merge operation into multiple stages and performs a series of operations at each stage to ensure data correctness. Specifically, changes to the Region epoch occur in the first stage of the merge process.

When performing a merge operation, TiKV increases the epoch of the target Region and sets it to a special value. This special epoch value indicates that the Region is undergoing a merge operation and will not be affected by other operations. At the same time, TiKV sets the epoch of the source Region to a larger value to ensure that the source Region is not affected by other operations during the merge process.

In this way, TiKV maintains the availability of the target Region during the merge process and avoids the occurrence of epoch not match errors. During the merge process, TiKV determines whether to allow operations on the Region based on changes in the epoch, thereby ensuring data consistency.

It should be noted that although there is no downtime during the merge process, there may be a period of service unavailability during the split process. This is because during the split process, a Region needs to be divided into multiple sub-Regions, which involves data redistribution and adjustment, potentially leading to a period of service unavailability.

| username: ylldty | Original post link

TiKV will determine whether to allow operations on a Region based on changes in the epoch.

This might need to be more detailed. During the merge process:

  1. During the merge process, the region epoch version will also increase. How does TiKV compare the region epoch version with the proposed region epoch version? How is this judgment logic different from split, and why does it avoid the epochnotmatch error?
  2. If it is determined not to allow it and does not report an epochnotmatch error, what is the next process?

When performing a merge operation, TiKV will increase the epoch of the target Region and set it to a special value. This special epoch value indicates that the Region is undergoing a merge operation.

This logic does not seem to be reflected in the documentation.

| username: neilshen | Original post link

When the target region applies CommitMerge, the epoch version also increases. Why does this avoid issues similar to split?

The successful application of CommitMerge will cause the epoch to increase, and there will be similar issues to split. However, I don’t think this is an availability issue. The client can simply retry after updating the epoch information, which usually takes between 10 to 100 milliseconds.

| username: Aionn | Original post link

Learned.

| username: zhaokede | Original post link

I haven’t really looked at TiDB’s source code, learned something new.

| username: xfworld | Original post link

Whether it is merge or split, all versions are related to TSO, which ensures that

TiKV uses a mechanism called "Atomic Region Merge" to ensure data consistency and availability.

Additionally, the scheduling is initiated by PD, which has information on all regions and keeps this information continuously updated through heartbeats.

The events between TiKV and PD are asynchronous and are triggered only when certain conditions are met. If you ask what those conditions are, I can’t answer either, it’s too complicated… :rofl: :rofl: :rofl: :rofl:

For example:
TiKV compaction is also an asynchronous event, and the conditions for triggering and taking effect are extremely complex… :rofl: :rofl: