Questions about the Region Split Mechanism

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于 region split 机制的疑问

| username: Qiuchi

[TiDB Usage Environment] Testing
[TiDB Version] v6.5.0
[Encountered Issue: Problem Phenomenon and Impact]
Based on my current understanding of region split and merge, the main related parameters are as follows:

// Control the maximum size of a region that PD can merge
show config where name like 'schedule.max-merge-region-size'; // 20MiB
show config where name like 'schedule.max-merge-region-keys'; // 200000

// Control the maximum size of a region before it is split by KV
show config where name like 'coprocessor.region-max-size'; // 144MiB
show config where name like 'coprocessor.region-max-keys'; // 1440000

// Control the size of the new region split from an existing region
show config where name like 'coprocessor.region-split-size'; // 96MiB
show config where name like 'coprocessor.region-split-keys'; // 960000

First, there seems to be some ambiguity with schedule.max-merge-region-size and schedule.max-merge-region-keys. Personally, I think it is necessary to clarify whether this configuration item is used to limit region merge or to trigger region merge.


My main question is about the third group, i.e., coprocessor.region-split-size / keys. Suppose there is a region with a size of 150MiB, which meets the requirements of the first two groups of configurations and needs to be split. How is the number and size of the regions decided after the split?

Can I assume that since coprocessor.region-split-size is 96MiB, we will get two regions of 96MiB and 54MiB respectively? If coprocessor.region-split-size is set to 40MiB, then we will get four regions of 40MiB, 40MiB, 40MiB, and 30MiB from left to right.

Another question, when using auto_increment, if the region is split according to the above scheme, will all the data still be concentrated in the rightmost region after the split? In this case, should the size of all our regions be almost 96MiB, and during continuous insertion, only the rightmost region will have a size change between 54MiB-144MiB, bearing the majority of the write pressure?

| username: TiDBer_jYQINSnf | Original post link

Splitting is done by halving, and there are two algorithms. One is precise splitting, which scans all the keys within the region’s range to calculate the key that marks the halfway point. The other is approximate estimation, which estimates the key that roughly halves the region and then splits based on that key. The max-merge-region-size/keys can be considered both a limit and a trigger; exceeding this limit allows merging, while not reaching it means no merging.

| username: Qiuchi | Original post link

I was confused because in Figure 2 it says “will not merge when exceeding this limit,” but it doesn’t explain whether it should merge if it doesn’t exceed the limit or what controls it. It feels like if it’s controlled by a single parameter, the wording would be different.
(I was also confused just now)

| username: TiDBer_jYQINSnf | Original post link

Misinterpreted the question…
These are the merge parameters.
If it exceeds 20 MB, it will not merge.
If it exceeds 144 MB, it will split. After splitting, it is assumed to be 72 MB, and then data is deleted. The region gradually decreases, and when it decreases to below 20 MB, it starts to merge.

| username: Qiuchi | Original post link

If it always splits in half, how does it determine the number of regions to split into? Also, what impact does coprocessor.region-split-size have on this process?

| username: TiDBer_jYQINSnf | Original post link

Divide it into two. I’m not sure about the coprocessor; I always check from PD. If dividing it into two still doesn’t meet the conditions, then trigger a split on the new region and divide it into two again.

| username: tidb菜鸟一只 | Original post link

My understanding:
schedule.max-merge-region-size is a parameter in the scheduler used to limit the maximum size (MB) for Region merging. When a Region’s size exceeds this limit, it will no longer merge with other Regions.
coprocessor.region-max-size is a parameter for the Coprocessor. It limits the maximum size (MB) of a single Region for the Coprocessor. When a Region’s size exceeds this limit, it starts to split.
coprocessor.region-split-size is also a parameter for the Coprocessor. It is similar to coprocessor.region-max-size, and when the threshold is exceeded, splitting begins. However, this is a new parameter introduced in version 6.1.0.

| username: Qiuchi | Original post link

I see that this is related to the size of the region after splitting, but there is no detailed explanation of its purpose. Previously, I read this Region Sharding - :ringer_planet: TiDB - TiDB Q&A Community (asktug.com) where people mentioned that “the default size of the region allocated after splitting is 96MiB,” but it still feels unclear. I am not sure if it will lead to regions of inconsistent sizes after splitting.

Additionally, the description “split into multiple” in this parameter makes me even unsure whether it will definitely split into two regions.
image

| username: tidb菜鸟一只 | Original post link

Looking at the example you provided, let’s assume there is a region with a size of 150MiB. To meet the requirements of the first two sets of configurations, a split is needed. How do we decide the number and size of regions after the split?

When coprocessor.region-split-size is set to 96MiB, we will get two regions of 75MiB each after the split, because 75 < 96, so no further splitting will occur. If coprocessor.region-split-size is set to 40MiB, we will first get two regions of 75MiB each, and since 75 > 40, they will be further split into four regions of 37.5MiB each.

Another issue is when using auto_increment. If regions are split as described above, after the split, all data will still be concentrated in the rightmost region. At this point, the size of all regions should be almost 144MiB, and during continuous inserts, only the rightmost region will have size changes from 0MiB to 144MiB. Once it reaches 144MiB, a new region will be created. Therefore, when using auto_increment, there indeed exists a write hotspot issue. To address this, TiDB provides the Split Region syntax, specifically optimized for short-term bulk write scenarios.

| username: Qiuchi | Original post link

When coprocessor.region-split-size is set to 96MiB, we will get two regions of 75MiB each on the left and right, because 75 < 96 will not continue to split. If coprocessor.region-split-size is set to 40MiB, we will first get two regions of 75MiB each on the left and right, because 75 > 40, it will continue to split into four regions of 37.5MiB each.

But in this case, before a region reaches the limit of coprocessor.region-max-size = 144MiB, it will first hit the threshold of coprocessor.region-split-size = 40 or 75 and split. Doesn’t this mean that the purposes of the two parameters conflict?

In the second question, I mainly want to focus on which parameter controls the average size of the region. Because if only the rightmost region is being split, and the previously split regions do not have new data written to them (due to auto_increment), will this result in most regions actually being half the size of coprocessor.region-max-size?

| username: tidb菜鸟一只 | Original post link

The coprocessor.region-split-size parameter value is the maximum size after splitting. It should not be triggered during natural growth; it is only triggered after splitting when the region-max-size is reached. Otherwise, if region-split-size is smaller than region-max-size, the region-max-size parameter would be useless.

Regarding the second question, theoretically, if you use auto_increment, it will indeed result in most regions being about half the size of coprocessor.region-max-size.

| username: Qiuchi | Original post link

I just tested it, and with the parameters set as mentioned in the question, I found that when the region first splits, the left and right regions are indeed exactly the same size. Although I have always been inserting with auto-increment IDs, sometimes it seems that it does not necessarily split the new region at the end of the entire range. Instead, it might adjust the region at the beginning of the range, even if no new data was inserted at that time. (Region changes from 340000 to 440000)

A bit strange.

When there is no data

REGION_ID START_KEY END_KEY LEADER_ID LEADER_STORE_ID PEERS SCATTERING WRITTEN_BYTES READ_BYTES APPROXIMATE_SIZE(MB) APPROXIMATE_KEYS SCHEDULING_CONSTRAINTS SCHEDULING_STATE
1152529 t_25197_ 1152531 7 1152530, 1152531, 1152532 0 248 138042 29 41397

After inserting 160,000 rows of data

  • Just after split
REGION_ID START_KEY END_KEY LEADER_ID LEADER_STORE_ID PEERS SCATTERING WRITTEN_BYTES READ_BYTES APPROXIMATE_SIZE(MB) APPROXIMATE_KEYS SCHEDULING_CONSTRAINTS SCHEDULING_STATE
1315381 t_25197_ t_25197_r_108643 1315383 7 1315382, 1315383, 1315384 0 39 0 75 321510
1152529 t_25197_r_108643 1152531 7 1152530, 1152531, 1152532 0 95544232 119890 75 321510
  • After some time post-split
REGION_ID START_KEY END_KEY LEADER_ID LEADER_STORE_ID PEERS SCATTERING WRITTEN_BYTES READ_BYTES APPROXIMATE_SIZE(MB) APPROXIMATE_KEYS SCHEDULING_CONSTRAINTS SCHEDULING_STATE
1315381 t_25197_ t_25197_r_108643 1315383 7 1315382, 1315383, 1315384 0 0 97245772 95 537284
1152529 t_25197_r_108643 1152531 7 1152530, 1152531, 1152532 0 0 293310 131 103120

After inserting 340,000 rows of data

  • Just after split
REGION_ID START_KEY END_KEY LEADER_ID LEADER_STORE_ID PEERS SCATTERING WRITTEN_BYTES READ_BYTES APPROXIMATE_SIZE(MB) APPROXIMATE_KEYS SCHEDULING_CONSTRAINTS SCHEDULING_STATE
1315381 t_25197_ t_25197_r_108643 1315383 7 1315382, 1315383, 1315384 0 47125038 117705772 123 817284
1315385 t_25197_r_108643 t_25197_r_270175 1315387 7 1315386, 1315387, 1315388 0 0 0 81 232868
1152529 t_25197_r_270175 1152531 7 1152530, 1152531, 1152532 0 64541976 119092468 81 232868
  • After some time post-split
REGION_ID START_KEY END_KEY LEADER_ID LEADER_STORE_ID PEERS SCATTERING WRITTEN_BYTES READ_BYTES APPROXIMATE_SIZE(MB) APPROXIMATE_KEYS SCHEDULING_CONSTRAINTS SCHEDULING_STATE
1315381 t_25197_ t_25197_r_108643 1315383 7 1315382, 1315383, 1315384 0 31416720 63240000 131 897284
1315385 t_25197_r_108643 t_25197_r_270175 1315387 7 1315386, 1315387, 1315388 0 0 0 81 232868
1152529 t_25197_r_270175 1152531 7 1152530, 1152531, 1152532 0 266 67055 119 106304

After inserting 440,000 rows of data

  • Just after split
REGION_ID START_KEY END_KEY LEADER_ID LEADER_STORE_ID PEERS SCATTERING WRITTEN_BYTES READ_BYTES APPROXIMATE_SIZE(MB) APPROXIMATE_KEYS SCHEDULING_CONSTRAINTS SCHEDULING_STATE
1315389 t_25197_ t_25197_r_16084 1315391 7 1315390, 1315391, 1315392 0 39 0 75 548642
1315381 t_25197_r_16084 t_25197_r_108643 1315383 7 1315382, 1315383, 1315384 0 78541260 40920000 75 548642
1315385 t_25197_r_108643 t_25197_r_270175 1315387 7 1315386, 1315387, 1315388 0 39 0 110 75297
1152529 t_25197_r_270175 1152531 7 1152530, 1152531, 1152532 0 0 293310 119 106304
  • After some time post-split
REGION_ID START_KEY END_KEY LEADER_ID LEADER_STORE_ID PEERS SCATTERING WRITTEN_BYTES READ_BYTES APPROXIMATE_SIZE(MB) APPROXIMATE_KEYS SCHEDULING_CONSTRAINTS SCHEDULING_STATE
1315389 t_25197_ t_25197_r_16084 1315391 7 1315390, 1315391, 1315392 0 0 0 107 902733
1315381 t_25197_r_16084 t_25197_r_108643 1315383 7 1315382, 1315383, 1315384 0 278 0 52 81920
1315385 t_25197_r_108643 t_25197_r_270175 1315387 7 1315386, 1315387, 1315388 0 39 0 110 75297
1152529 t_25197_r_270175 1152531 7 1152530, 1152531, 1152532 0 0 293310 138 228509

After inserting 540,000 rows of data

  • Just after split
REGION_ID START_KEY END_KEY LEADER_ID LEADER_STORE_ID PEERS SCATTERING WRITTEN_BYTES READ_BYTES APPROXIMATE_SIZE(MB) APPROXIMATE_KEYS SCHEDULING_CONSTRAINTS SCHEDULING_STATE
1315389 t_25197_ t_25197_r_16084 1315391 7 1315390, 1315391, 1315392 0 78540552 60210511 126 1214773
1315381 t_25197_r_16084 t_25197_r_108643 1315383 7 1315382, 1315383, 1315384 0 0 57495261 52 81920
1315385 t_25197_r_108643 t_25197_r_270175 1315387 7 1315386, 1315387, 1315388 0 39 0 110 75297
1315393 t_25197_r_270175 t_25197_r_431708 1315395 7 1315394, 1315395, 1315396 0 39 0 92 271336
1152529 t_25197_r_431708 1152531 7 1152530, 1152531, 1152532 0 80676142 86015 92 271336
  • After some time post-split
REGION_ID START_KEY END_KEY LEADER_ID LEADER_STORE_ID PEERS SCATTERING WRITTEN_BYTES READ_BYTES APPROXIMATE_SIZE(MB) APPROXIMATE_KEYS SCHEDULING_CONSTRAINTS SCHEDULING_STATE
1315389 t_25197_ t_25197_r_16084 1315391 7 1315390, 1315391, 1315392 0 0 50220000 126 1214773
1315381 t_25197_r_16084 t_25197_r_108643 1315383 7 1315382, 1315383, 1315384 0 0 0 52 81920
1315385 t_25197_r_108643 t_25197_r_270175 1315387 7 1315386, 1315387, 1315388 0 0 100339986 110 75297
1315393 t_25197_r_270175 t_25197_r_431708 1315395 7 1315394, 1315395, 1315396 0 0 100340050 95 157031
1152529 t_25197_r_431708 1152531 7 1152530, 1152531, 1152532 0 0 289060 109 154677

After inserting 660,000 rows of data

  • Just after split
REGION_ID START_KEY END_KEY LEADER_ID LEADER_STORE_ID PEERS SCATTERING WRITTEN_BYTES READ_BYTES APPROXIMATE_SIZE(MB) APPROXIMATE_KEYS SCHEDULING_CONSTRAINTS SCHEDULING_STATE
1315389 t_25197_ t_25197_r_16084 1315391 7 1315390, 1315391, 1315392 0 94250670 107456465 141 1294794
1315381 t_25197_r_16084 t_25197_r_108643 1315383 7 1315382, 1315383, 1315384 0 0 0 52 81920
1315385 t_25197_r_108643 t_25197_r_270175 1315387 7 1315386, 1315387, 1315388 0 0 0 110 75297
1315393 t_25197_r_270175 t_25197_r_431708 1315395 7 1315394, 1315395, 1315396 0 0 0 95 157031
1315397 t_25197_r_431708 t_25197_r_593240 1315399 7 1315398, 1315399, 1315400 0 39 0 80 229803
1152529 t_25197_r_593240 1152531 7 1152530, 1152531, 1152532 0 96808754 71305 80 229803
  • After some time post-split
REGION_ID START_KEY END_KEY LEADER_ID LEADER_STORE_ID PEERS SCATTERING WRITTEN_BYTES READ_BYTES APPROXIMATE_SIZE(MB) APPROXIMATE_KEYS SCHEDULING_CONSTRAINTS SCHEDULING_STATE
1315389 t_25197_ t_25197_r_16084 1315391 7 1315390, 1315391, 1315392 0 94250670 107456465 141 1294794
1315381 t_25197_r_16084 t_25197_r_108643 1315383 7 1315382, 1315383, 1315384 0 0 0 52 81920
1315385 t_25197_r_108643 t_25197_r_270175 1315387 7 1315386, 1315387, 1315388 0 0 0 110 75297
1315393 t_25197_r_270175 t_25197_r_431708 1315395 7 1315394, 1315395, 1315396 0 0 0 95 157031
1315397 t_25197_r_431708 t_25197_r_593240 1315399 7 1315398, 1315399, 1315400 0 39 0 80 229803
1152529 t_25197_r_593240 1152531 7 1152530, 1152531, 1152532 0 266 222005 77 121525
| username: tidb菜鸟一只 | Original post link

The splitting of regions in TiDB is not only executed when a region exceeds the maximum size. A Region split request is executed when the size of a region exceeds a certain threshold, or when hotspot data causes excessive load pressure. In such cases, the scheduler in the TiKV cluster will issue a Region split request.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.