What is the appropriate number of regions on TiKV?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv上region数量多少合适

| username: 路在何chu

[TiDB Usage Environment] Production Environment / Testing / PoC
4.0.13
[Reproduction Path] What operations were performed when the issue occurred
What is the appropriate number of regions, and when should expansion be considered?

| username: 大飞哥online | Original post link

This doesn’t have specific data; it still depends on the business. A region is 96M by default. If there are too many regions, there will be more heartbeat checks with PD, leading to more resource interactions. If the disk space is insufficient, expansion will be necessary.

| username: Fly-bird | Original post link

It seems like there are no specific data, so it depends on your resource situation.

| username: Kongdom | Original post link

The number of regions should not be related; it is related to the used space size of the TiKV node, which should not exceed 2TB.

| username: tidb菜鸟一只 | Original post link

It is generally not recommended to have more than 20,000 regions on a single TiKV node, as exceeding this number may lead to performance degradation.

| username: 像风一样的男子 | Original post link

It depends on your financial strength. If you have money, you can add as many KVs as you want.

| username: 大飞哥online | Original post link

This works, hahaha, money power.

| username: 路在何chu | Original post link

Currently, each TiKV has approximately 25,000 regions.

| username: 路在何chu | Original post link

It’s because there’s no money and they’re unwilling to scale up, hahaha.

| username: 像风一样的男子 | Original post link

You can talk to the business team and archive some unused data before deleting it to save costs.

| username: chenhanneu | Original post link

How much disk space does the 25k region occupy?

| username: 路在何chu | Original post link

Approximately 1.2T.

| username: 路在何chu | Original post link

Currently scaling down, will check after it’s done tomorrow.

| username: chenhanneu | Original post link

As the number of regions increases, does the memory usage of TiKV also gradually increase to maintain so many regions? How do you handle this memory alarm in the end? Have you encountered this situation before?

| username: 路在何chu | Original post link

We have 128GB of memory, with a usage rate of about 50%, and we haven’t encountered this issue. If it really doesn’t work, consider adding more memory.

| username: 路在何chu | Original post link

There is a table with over 500GB of data, preparing to clean it up.

| username: TiDBer_小阿飞 | Original post link

You can view the relevant monitoring metrics under the TiKV panel in Grafana. Check the Raft store CPU under Thread-CPU to see if it has reached a bottleneck. If it exceeds 85%, it is recommended to first adjust using the following strategies before considering expanding TiKV!

  1. If I/O resources and CPU resources are relatively sufficient, you can deploy multiple TiKV instances on a single machine to reduce the number of Regions on a single TiKV instance.
  2. Reduce the number of messages per unit time in the Region to reduce the pressure on the Raftstore.
  3. Increase the concurrency of Raftstore.
  4. Enable the Hibernate Region feature.
  5. Enabling Region Merge can also reduce the number of Regions. Contrary to Region Split, Region Merge is the process of merging adjacent small Regions through scheduling. After deleting data in the cluster or executing Drop Table/Truncate Table statements, small or even empty Regions can be merged to reduce resource consumption.
  6. The default size of a Region is about 96 MiB; increasing it can also reduce the number of Regions.
| username: 路在何chu | Original post link

Increasing the region size is too risky, I don’t dare to do it. The CPU resources are definitely sufficient, but the insert statements are unstable. Some take tens of milliseconds, some take hundreds of milliseconds, and some even take up to a minute. The application cannot tolerate too many 300-millisecond delays.

| username: 路在何chu | Original post link

However, after adding a TiKV node yesterday, the number of SQL queries taking more than 300ms has significantly decreased. Having fewer regions definitely has its benefits.

| username: 大飞哥online | Original post link

The default is 96M, which is generally applicable.