Why does TiKV scaling always trigger a bunch of merge-regions?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 为啥tikv扩容总是触发一堆merge-region?

| username: anteguitado

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.5
[Reproduction Path] Added several TiKV nodes through tiup cluster scale-out
[Encountered Problem: Symptoms and Impact] Found a bunch of merge-region tasks through operator show in pd-ctl, while balance-region tasks are very few. The scale-out was performed during off-peak hours at night, and operator show was basically empty before the scale-out.
[Resource Configuration] Expanded from tikv8 to tikv12
[Attachments: Screenshots/Logs/Monitoring]





| username: dba远航 | Original post link

The merging of REGIN is influenced by the REGIN size parameter.

| username: anteguitado | Original post link

There isn’t much actual business at night. It’s obvious that the merge-region is caused by scaling out. Why does scaling out require a bunch of merge-regions?

| username: Kongdom | Original post link

:thinking: Is it possible that the expansion caused the reduction of regions on a single node, and then the region merge occurred?

| username: 江湖故人 | Original post link

After enabling region merge, a merge can only occur if both the region size triggers max-merge-region-size and the number of keys triggers max-merge-region-keys. Could it be that you just happened to see it? How many times did you check before and after the expansion?

| username: 江湖故人 | Original post link

This graph cannot illustrate the trend of region merge, as the schedule also includes migration operations.

| username: anteguitado | Original post link

I actually didn’t understand this chart either :sweat_smile:, but I noticed that both skip and no-target-store are very high. Does this indicate some kind of anomaly?

| username: 江湖故人 | Original post link

I don’t understand what “skip” is either.
@Kongdom, can you help clarify?

| username: anteguitado | Original post link

It wasn’t just shown once; it was summarized after frequently executing the show command many times after the expansion started. I picked one instance to post here.

| username: anteguitado | Original post link

I have added two images. It seems that there are still many balance-leader and balance-region operations occurring during scaling, but there are indeed a bunch of merge-region operations accompanying balance-leader.

Maybe the reason we always see a bunch of merge-region operations in the operator show is that merge-region is the heaviest and slowest of these operations. But why does scaling trigger merge-region?

| username: Kongdom | Original post link

:thinking: It should be the literal meaning, which is to skip. It means skipping the scheduling.

| username: 江湖故人 | Original post link

Indeed, the trend in the second graph is already very obvious. Theoretically, performing a merge during expansion can reduce the number of IO operations, but I haven’t found any related documentation.

| username: 烂番薯0 | Original post link

…Uh, I can’t understand the monitoring.

| username: dba远航 | Original post link

It feels like a merge-region check was done, but almost all of them were skipped or no operations were performed, so there was no change.

| username: anteguitado | Original post link

Looking at operator show and Schedule Operator Create, that’s not the case. There are many merge-regions being executed. Additionally, the Schedule Operator Finish data is not posted, but it is basically consistent with the Schedule Operator Create data, indicating that they were indeed executed.

| username: tidb菜鸟一只 | Original post link

I feel that after your migration, a merge check was performed on the regions added to the new node, but no actual merge was done.