[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5.5
[Reproduction Path] Added several TiKV nodes through tiup cluster scale-out
[Encountered Problem: Symptoms and Impact] Found a bunch of merge-region tasks through operator show in pd-ctl, while balance-region tasks are very few. The scale-out was performed during off-peak hours at night, and operator show was basically empty before the scale-out.
[Resource Configuration] Expanded from tikv8 to tikv12
[Attachments: Screenshots/Logs/Monitoring]
There isn’t much actual business at night. It’s obvious that the merge-region is caused by scaling out. Why does scaling out require a bunch of merge-regions?
After enabling region merge, a merge can only occur if both the region size triggers max-merge-region-size and the number of keys triggers max-merge-region-keys. Could it be that you just happened to see it? How many times did you check before and after the expansion?
I actually didn’t understand this chart either , but I noticed that both skip and no-target-store are very high. Does this indicate some kind of anomaly?
It wasn’t just shown once; it was summarized after frequently executing the show command many times after the expansion started. I picked one instance to post here.
I have added two images. It seems that there are still many balance-leader and balance-region operations occurring during scaling, but there are indeed a bunch of merge-region operations accompanying balance-leader.
Maybe the reason we always see a bunch of merge-region operations in the operator show is that merge-region is the heaviest and slowest of these operations. But why does scaling trigger merge-region?
Indeed, the trend in the second graph is already very obvious. Theoretically, performing a merge during expansion can reduce the number of IO operations, but I haven’t found any related documentation.
Looking at operator show and Schedule Operator Create, that’s not the case. There are many merge-regions being executed. Additionally, the Schedule Operator Finish data is not posted, but it is basically consistent with the Schedule Operator Create data, indicating that they were indeed executed.