Questions about the Region Split Mechanism

translator_bot · June 22, 2024, 5:25pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于 region split 机制的疑问

| username: Qiuchi

[TiDB Usage Environment] Testing
[TiDB Version] v6.5.0
[Encountered Issue: Problem Phenomenon and Impact]
Based on my current understanding of region split and merge, the main related parameters are as follows:

// Control the maximum size of a region that PD can merge
show config where name like 'schedule.max-merge-region-size'; // 20MiB
show config where name like 'schedule.max-merge-region-keys'; // 200000

// Control the maximum size of a region before it is split by KV
show config where name like 'coprocessor.region-max-size'; // 144MiB
show config where name like 'coprocessor.region-max-keys'; // 1440000

// Control the size of the new region split from an existing region
show config where name like 'coprocessor.region-split-size'; // 96MiB
show config where name like 'coprocessor.region-split-keys'; // 960000

First, there seems to be some ambiguity with schedule.max-merge-region-size and schedule.max-merge-region-keys. Personally, I think it is necessary to clarify whether this configuration item is used to limit region merge or to trigger region merge.

My main question is about the third group, i.e., coprocessor.region-split-size / keys. Suppose there is a region with a size of 150MiB, which meets the requirements of the first two groups of configurations and needs to be split. How is the number and size of the regions decided after the split?

Can I assume that since coprocessor.region-split-size is 96MiB, we will get two regions of 96MiB and 54MiB respectively? If coprocessor.region-split-size is set to 40MiB, then we will get four regions of 40MiB, 40MiB, 40MiB, and 30MiB from left to right.

Another question, when using auto_increment, if the region is split according to the above scheme, will all the data still be concentrated in the rightmost region after the split? In this case, should the size of all our regions be almost 96MiB, and during continuous insertion, only the rightmost region will have a size change between 54MiB-144MiB, bearing the majority of the write pressure?

translator_bot · June 22, 2024, 5:25pm

| username: TiDBer_jYQINSnf | Original post link

Splitting is done by halving, and there are two algorithms. One is precise splitting, which scans all the keys within the region’s range to calculate the key that marks the halfway point. The other is approximate estimation, which estimates the key that roughly halves the region and then splits based on that key. The max-merge-region-size/keys can be considered both a limit and a trigger; exceeding this limit allows merging, while not reaching it means no merging.

translator_bot · June 22, 2024, 5:25pm

| username: Qiuchi | Original post link

I was confused because in Figure 2 it says “will not merge when exceeding this limit,” but it doesn’t explain whether it should merge if it doesn’t exceed the limit or what controls it. It feels like if it’s controlled by a single parameter, the wording would be different.
(I was also confused just now)

translator_bot · June 22, 2024, 5:25pm

| username: TiDBer_jYQINSnf | Original post link

Misinterpreted the question…
These are the merge parameters.
If it exceeds 20 MB, it will not merge.
If it exceeds 144 MB, it will split. After splitting, it is assumed to be 72 MB, and then data is deleted. The region gradually decreases, and when it decreases to below 20 MB, it starts to merge.

translator_bot · June 22, 2024, 5:25pm

| username: Qiuchi | Original post link

If it always splits in half, how does it determine the number of regions to split into? Also, what impact does coprocessor.region-split-size have on this process?

translator_bot · June 22, 2024, 5:25pm

| username: TiDBer_jYQINSnf | Original post link

Divide it into two. I’m not sure about the coprocessor; I always check from PD. If dividing it into two still doesn’t meet the conditions, then trigger a split on the new region and divide it into two again.

translator_bot · June 22, 2024, 5:25pm

| username: tidb菜鸟一只 | Original post link

My understanding:
schedule.max-merge-region-size is a parameter in the scheduler used to limit the maximum size (MB) for Region merging. When a Region’s size exceeds this limit, it will no longer merge with other Regions.
coprocessor.region-max-size is a parameter for the Coprocessor. It limits the maximum size (MB) of a single Region for the Coprocessor. When a Region’s size exceeds this limit, it starts to split.
coprocessor.region-split-size is also a parameter for the Coprocessor. It is similar to coprocessor.region-max-size, and when the threshold is exceeded, splitting begins. However, this is a new parameter introduced in version 6.1.0.

translator_bot · June 22, 2024, 5:25pm

| username: Qiuchi | Original post link

I see that this is related to the size of the region after splitting, but there is no detailed explanation of its purpose. Previously, I read this Region Sharding - TiDB - TiDB Q&A Community (asktug.com) where people mentioned that “the default size of the region allocated after splitting is 96MiB,” but it still feels unclear. I am not sure if it will lead to regions of inconsistent sizes after splitting.

Additionally, the description “split into multiple” in this parameter makes me even unsure whether it will definitely split into two regions.

translator_bot · June 22, 2024, 5:26pm

| username: tidb菜鸟一只 | Original post link

Looking at the example you provided, let’s assume there is a region with a size of 150MiB. To meet the requirements of the first two sets of configurations, a split is needed. How do we decide the number and size of regions after the split?

When coprocessor.region-split-size is set to 96MiB, we will get two regions of 75MiB each after the split, because 75 < 96, so no further splitting will occur. If coprocessor.region-split-size is set to 40MiB, we will first get two regions of 75MiB each, and since 75 > 40, they will be further split into four regions of 37.5MiB each.

Another issue is when using auto_increment. If regions are split as described above, after the split, all data will still be concentrated in the rightmost region. At this point, the size of all regions should be almost 144MiB, and during continuous inserts, only the rightmost region will have size changes from 0MiB to 144MiB. Once it reaches 144MiB, a new region will be created. Therefore, when using auto_increment, there indeed exists a write hotspot issue. To address this, TiDB provides the Split Region syntax, specifically optimized for short-term bulk write scenarios.

translator_bot · June 22, 2024, 5:26pm

| username: Qiuchi | Original post link

When coprocessor.region-split-size is set to 96MiB, we will get two regions of 75MiB each on the left and right, because 75 < 96 will not continue to split. If coprocessor.region-split-size is set to 40MiB, we will first get two regions of 75MiB each on the left and right, because 75 > 40, it will continue to split into four regions of 37.5MiB each.

But in this case, before a region reaches the limit of coprocessor.region-max-size = 144MiB, it will first hit the threshold of coprocessor.region-split-size = 40 or 75 and split. Doesn’t this mean that the purposes of the two parameters conflict?

In the second question, I mainly want to focus on which parameter controls the average size of the region. Because if only the rightmost region is being split, and the previously split regions do not have new data written to them (due to auto_increment), will this result in most regions actually being half the size of coprocessor.region-max-size?

translator_bot · June 22, 2024, 5:26pm

| username: tidb菜鸟一只 | Original post link

The coprocessor.region-split-size parameter value is the maximum size after splitting. It should not be triggered during natural growth; it is only triggered after splitting when the region-max-size is reached. Otherwise, if region-split-size is smaller than region-max-size, the region-max-size parameter would be useless.

Regarding the second question, theoretically, if you use auto_increment, it will indeed result in most regions being about half the size of coprocessor.region-max-size.

translator_bot · June 22, 2024, 5:26pm

| username: Qiuchi | Original post link

I just tested it, and with the parameters set as mentioned in the question, I found that when the region first splits, the left and right regions are indeed exactly the same size. Although I have always been inserting with auto-increment IDs, sometimes it seems that it does not necessarily split the new region at the end of the entire range. Instead, it might adjust the region at the beginning of the range, even if no new data was inserted at that time. (Region changes from 340000 to 440000)

A bit strange.

When there is no data

REGION_ID	START_KEY	END_KEY	LEADER_ID	LEADER_STORE_ID	PEERS	SCATTERING	WRITTEN_BYTES	READ_BYTES	APPROXIMATE_SIZE(MB)	APPROXIMATE_KEYS	SCHEDULING_CONSTRAINTS	SCHEDULING_STATE
1152529	t_25197_		1152531	7	1152530, 1152531, 1152532	0	248	138042	29	41397

After inserting 160,000 rows of data

Just after split

REGION_ID	START_KEY	END_KEY	LEADER_ID	LEADER_STORE_ID	PEERS	SCATTERING	WRITTEN_BYTES	READ_BYTES	APPROXIMATE_SIZE(MB)	APPROXIMATE_KEYS	SCHEDULING_CONSTRAINTS	SCHEDULING_STATE
1315381	t_25197_	t_25197_r_108643	1315383	7	1315382, 1315383, 1315384	0	39	0	75	321510
1152529	t_25197_r_108643		1152531	7	1152530, 1152531, 1152532	0	95544232	119890	75	321510

After some time post-split

REGION_ID	START_KEY	END_KEY	LEADER_ID	LEADER_STORE_ID	PEERS	SCATTERING	WRITTEN_BYTES	READ_BYTES	APPROXIMATE_SIZE(MB)	APPROXIMATE_KEYS	SCHEDULING_CONSTRAINTS	SCHEDULING_STATE
1315381	t_25197_	t_25197_r_108643	1315383	7	1315382, 1315383, 1315384	0	0	97245772	95	537284
1152529	t_25197_r_108643		1152531	7	1152530, 1152531, 1152532	0	0	293310	131	103120

After inserting 340,000 rows of data

Just after split

REGION_ID	START_KEY	END_KEY	LEADER_ID	LEADER_STORE_ID	PEERS	WRITTEN_BYTES	READ_BYTES	APPROXIMATE_SIZE(MB)	APPROXIMATE_KEYS
1315381	t_25197_	t_25197_r_108643	1315383	7	1315382, 1315383, 1315384	47125038	117705772	123	817284
1315385	t_25197_r_108643	t_25197_r_270175	1315387	7	1315386, 1315387, 1315388	0	0	81	232868
1152529	t_25197_r_270175		1152531	7	1152530, 1152531, 1152532	64541976	119092468	81	232868

After some time post-split

REGION_ID	START_KEY	END_KEY	LEADER_ID	LEADER_STORE_ID	PEERS	WRITTEN_BYTES	READ_BYTES	APPROXIMATE_SIZE(MB)	APPROXIMATE_KEYS
1315381	t_25197_	t_25197_r_108643	1315383	7	1315382, 1315383, 1315384	31416720	63240000	131	897284
1315385	t_25197_r_108643	t_25197_r_270175	1315387	7	1315386, 1315387, 1315388	0	0	81	232868
1152529	t_25197_r_270175		1152531	7	1152530, 1152531, 1152532	266	67055	119	106304

After inserting 440,000 rows of data

Just after split

REGION_ID	START_KEY	END_KEY	LEADER_ID	LEADER_STORE_ID	PEERS	WRITTEN_BYTES	READ_BYTES	APPROXIMATE_SIZE(MB)	APPROXIMATE_KEYS
1315389	t_25197_	t_25197_r_16084	1315391	7	1315390, 1315391, 1315392	39	0	75	548642
1315381	t_25197_r_16084	t_25197_r_108643	1315383	7	1315382, 1315383, 1315384	78541260	40920000	75	548642
1315385	t_25197_r_108643	t_25197_r_270175	1315387	7	1315386, 1315387, 1315388	39	0	110	75297
1152529	t_25197_r_270175		1152531	7	1152530, 1152531, 1152532	0	293310	119	106304

After some time post-split

REGION_ID	START_KEY	END_KEY	LEADER_ID	LEADER_STORE_ID	PEERS	WRITTEN_BYTES	READ_BYTES	APPROXIMATE_SIZE(MB)	APPROXIMATE_KEYS
1315389	t_25197_	t_25197_r_16084	1315391	7	1315390, 1315391, 1315392	0	0	107	902733
1315381	t_25197_r_16084	t_25197_r_108643	1315383	7	1315382, 1315383, 1315384	278	0	52	81920
1315385	t_25197_r_108643	t_25197_r_270175	1315387	7	1315386, 1315387, 1315388	39	0	110	75297
1152529	t_25197_r_270175		1152531	7	1152530, 1152531, 1152532	0	293310	138	228509

After inserting 540,000 rows of data

Just after split

REGION_ID	START_KEY	END_KEY	LEADER_ID	LEADER_STORE_ID	PEERS	WRITTEN_BYTES	READ_BYTES	APPROXIMATE_SIZE(MB)	APPROXIMATE_KEYS
1315389	t_25197_	t_25197_r_16084	1315391	7	1315390, 1315391, 1315392	78540552	60210511	126	1214773
1315381	t_25197_r_16084	t_25197_r_108643	1315383	7	1315382, 1315383, 1315384	0	57495261	52	81920
1315385	t_25197_r_108643	t_25197_r_270175	1315387	7	1315386, 1315387, 1315388	39	0	110	75297
1315393	t_25197_r_270175	t_25197_r_431708	1315395	7	1315394, 1315395, 1315396	39	0	92	271336
1152529	t_25197_r_431708		1152531	7	1152530, 1152531, 1152532	80676142	86015	92	271336

After some time post-split

REGION_ID	START_KEY	END_KEY	LEADER_ID	LEADER_STORE_ID	PEERS	READ_BYTES	APPROXIMATE_SIZE(MB)	APPROXIMATE_KEYS
1315389	t_25197_	t_25197_r_16084	1315391	7	1315390, 1315391, 1315392	50220000	126	1214773
1315381	t_25197_r_16084	t_25197_r_108643	1315383	7	1315382, 1315383, 1315384	0	52	81920
1315385	t_25197_r_108643	t_25197_r_270175	1315387	7	1315386, 1315387, 1315388	100339986	110	75297
1315393	t_25197_r_270175	t_25197_r_431708	1315395	7	1315394, 1315395, 1315396	100340050	95	157031
1152529	t_25197_r_431708		1152531	7	1152530, 1152531, 1152532	289060	109	154677

After inserting 660,000 rows of data

Just after split

REGION_ID	START_KEY	END_KEY	LEADER_ID	LEADER_STORE_ID	PEERS	WRITTEN_BYTES	READ_BYTES	APPROXIMATE_SIZE(MB)	APPROXIMATE_KEYS
1315389	t_25197_	t_25197_r_16084	1315391	7	1315390, 1315391, 1315392	94250670	107456465	141	1294794
1315381	t_25197_r_16084	t_25197_r_108643	1315383	7	1315382, 1315383, 1315384	0	0	52	81920
1315385	t_25197_r_108643	t_25197_r_270175	1315387	7	1315386, 1315387, 1315388	0	0	110	75297
1315393	t_25197_r_270175	t_25197_r_431708	1315395	7	1315394, 1315395, 1315396	0	0	95	157031
1315397	t_25197_r_431708	t_25197_r_593240	1315399	7	1315398, 1315399, 1315400	39	0	80	229803
1152529	t_25197_r_593240		1152531	7	1152530, 1152531, 1152532	96808754	71305	80	229803

After some time post-split

REGION_ID	START_KEY	END_KEY	LEADER_ID	LEADER_STORE_ID	PEERS	WRITTEN_BYTES	READ_BYTES	APPROXIMATE_SIZE(MB)	APPROXIMATE_KEYS
1315389	t_25197_	t_25197_r_16084	1315391	7	1315390, 1315391, 1315392	94250670	107456465	141	1294794
1315381	t_25197_r_16084	t_25197_r_108643	1315383	7	1315382, 1315383, 1315384	0	0	52	81920
1315385	t_25197_r_108643	t_25197_r_270175	1315387	7	1315386, 1315387, 1315388	0	0	110	75297
1315393	t_25197_r_270175	t_25197_r_431708	1315395	7	1315394, 1315395, 1315396	0	0	95	157031
1315397	t_25197_r_431708	t_25197_r_593240	1315399	7	1315398, 1315399, 1315400	39	0	80	229803
1152529	t_25197_r_593240		1152531	7	1152530, 1152531, 1152532	266	222005	77	121525

translator_bot · June 22, 2024, 5:26pm

| username: tidb菜鸟一只 | Original post link

The splitting of regions in TiDB is not only executed when a region exceeds the maximum size. A Region split request is executed when the size of a region exceeds a certain threshold, or when hotspot data causes excessive load pressure. In such cases, the scheduler in the TiKV cluster will issue a Region split request.

translator_bot · June 22, 2024, 5:26pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.