The distribution of regions and leaders in TiKV is unbalanced, and PD does not schedule

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv的region、leader分布不均衡,pd不调度

| username: 像风一样的男子

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 5.4.2
[Reproduction Path] After executing the command tiup cluster scale-in to scale in TiKV, the issue was discovered.
[Encountered Issue: Symptoms and Impact]
It was found that the leaders of each TiKV were unevenly distributed, with one node’s TiKV leader count dropping to 0, and the regions were also very unevenly distributed.

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: xfworld | Original post link

Refer to the SOP:

Follow the steps in the SOP to troubleshoot first, summarize the issue, and if it still can’t be resolved, add some more information and take another look.

| username: 像风一样的男子 | Original post link

I followed the official documentation, and it’s been 2 days, but the TiKV region distribution still hasn’t automatically balanced.

| username: 像风一样的男子 | Original post link

This page 404s.

| username: xfworld | Original post link

No

| username: 像风一样的男子 | Original post link

Display: Oops! This page does not exist or is a private page.

| username: xfworld | Original post link

Try accessing this: 专栏 - 【SOP 系列 19】region 分布不均问题排查及解决不完全指南 | TiDB 社区

| username: TiDBer_jYQINSnf | Original post link

Your cluster is quite impressive, with over 200k regions. That’s really awesome, and it must require a large disk. The imbalance is most likely due to the labels. For example, with 3 replicas, they need to be on 3 different machines. If you have 3 machines, A, B, and C, and you have 2 TiKV instances on machine A, then the 2 TiKV instances on A will share one replica. This means that the regions on TiKV1 + TiKV2 on A will equal the number of regions on the other machines.

| username: Kongdom | Original post link

First, let’s take a look at the scoring of each node. Only when the scores are close can it be 100% balanced.

| username: Jiawei | Original post link

It should still be related to the scheduling parameters. Follow the SOP above for the troubleshooting process and confirm whether the specific scheduling parameters are reasonable.

| username: 像风一样的男子 | Original post link

I added a new node, but found that there is no scheduling at all, and no scoring either.

| username: 像风一样的男子 | Original post link

In my case, there is one KV per server, and there is no situation where one server has two KVs. After scaling down one KV, the distribution of regions across the KVs becomes unbalanced.

| username: TiDBer_jYQINSnf | Original post link

Show the results of the following commands:
pd-ctl scheduler show
pd-ctl config show
pd-ctl store

| username: 像风一样的男子 | Original post link

This file contains several PD parameters.

| username: 考试没答案 | Original post link

Send the store command.

| username: 考试没答案 | Original post link

The store limit can also be used to control scheduling. Please take a look.

| username: 像风一样的男子 | Original post link

The results are in this file.

| username: TiDBer_jYQINSnf | Original post link

Version 5.4.2 is not recommended.
There doesn’t seem to be any issue based on what you posted.
If you still want to investigate further, check the PD panel in Grafana and look at the operators to see if there are any create or cancel actions.

| username: 像风一样的男子 | Original post link

I haven’t had time to upgrade the version yet. The picture shows the PD monitoring.

| username: 考试没答案 | Original post link

I think you can wait a bit. You added 2 TiKV nodes at once.