The distribution of regions and leaders in TiKV is unbalanced, and PD does not schedule

translator_bot · June 22, 2024, 11:39am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv的region、leader分布不均衡，pd不调度

| username: 像风一样的男子

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] 5.4.2
[Reproduction Path] After executing the command tiup cluster scale-in to scale in TiKV, the issue was discovered.
[Encountered Issue: Symptoms and Impact]
It was found that the leaders of each TiKV were unevenly distributed, with one node’s TiKV leader count dropping to 0, and the regions were also very unevenly distributed.

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 22, 2024, 11:39am

| username: xfworld | Original post link

Refer to the SOP:

Follow the steps in the SOP to troubleshoot first, summarize the issue, and if it still can’t be resolved, add some more information and take another look.

translator_bot · June 22, 2024, 11:39am

| username: 像风一样的男子 | Original post link

I followed the official documentation, and it’s been 2 days, but the TiKV region distribution still hasn’t automatically balanced.

translator_bot · June 22, 2024, 11:39am

| username: 像风一样的男子 | Original post link

This page 404s.

translator_bot · June 22, 2024, 11:39am

| username: xfworld | Original post link

No

TiDB 的问答社区 – 2 Aug 21

【SOP 系列 19】region 分布不均问题排查及解决不完全指南

🌌 运维指南 TiDB 运维手册

一. 背景描述大家在使用 TiDB 的过程中有可能会遇到 Region 和 Leader 分布不均的情况，监控信息如下： leader 信息： region 信息：虽然最终解决了问题，但是整个排查及优化的过程还是很繁琐的，主要是因为架构的复杂度及对基础参数的不了解。为了方便以后对此类 region 分布不均问题的处理，结合官方文档的给出的思路整理一份排查及优化指南。二. 官方排查问题思路先贴一张 region 不均衡问题的官方的解释： ...

阅读时间: 3 mins 🕑 赞: 17 ❤

translator_bot · June 22, 2024, 11:39am

| username: 像风一样的男子 | Original post link

Display: Oops! This page does not exist or is a private page.

translator_bot · June 22, 2024, 11:39am

| username: xfworld | Original post link

Try accessing this: 专栏 - 【SOP 系列 19】region 分布不均问题排查及解决不完全指南 | TiDB 社区

translator_bot · June 22, 2024, 11:39am

| username: TiDBer_jYQINSnf | Original post link

Your cluster is quite impressive, with over 200k regions. That’s really awesome, and it must require a large disk. The imbalance is most likely due to the labels. For example, with 3 replicas, they need to be on 3 different machines. If you have 3 machines, A, B, and C, and you have 2 TiKV instances on machine A, then the 2 TiKV instances on A will share one replica. This means that the regions on TiKV1 + TiKV2 on A will equal the number of regions on the other machines.

translator_bot · June 22, 2024, 11:39am

| username: Kongdom | Original post link

First, let’s take a look at the scoring of each node. Only when the scores are close can it be 100% balanced.

translator_bot · June 22, 2024, 11:39am

| username: Jiawei | Original post link

It should still be related to the scheduling parameters. Follow the SOP above for the troubleshooting process and confirm whether the specific scheduling parameters are reasonable.

translator_bot · June 22, 2024, 11:39am

| username: 像风一样的男子 | Original post link

I added a new node, but found that there is no scheduling at all, and no scoring either.

translator_bot · June 22, 2024, 11:39am

| username: 像风一样的男子 | Original post link

In my case, there is one KV per server, and there is no situation where one server has two KVs. After scaling down one KV, the distribution of regions across the KVs becomes unbalanced.

translator_bot · June 22, 2024, 11:39am

| username: TiDBer_jYQINSnf | Original post link

Show the results of the following commands:
pd-ctl scheduler show
pd-ctl config show
pd-ctl store

translator_bot · June 22, 2024, 11:39am

| username: 像风一样的男子 | Original post link

This file contains several PD parameters.

translator_bot · June 22, 2024, 11:39am

| username: 考试没答案 | Original post link

Send the store command.

translator_bot · June 22, 2024, 11:39am

| username: 考试没答案 | Original post link

The store limit can also be used to control scheduling. Please take a look.

translator_bot · June 22, 2024, 11:39am

| username: 像风一样的男子 | Original post link

The results are in this file.

translator_bot · June 22, 2024, 11:39am

| username: TiDBer_jYQINSnf | Original post link

Version 5.4.2 is not recommended.
There doesn’t seem to be any issue based on what you posted.
If you still want to investigate further, check the PD panel in Grafana and look at the operators to see if there are any create or cancel actions.

translator_bot · June 22, 2024, 11:39am

| username: 像风一样的男子 | Original post link

I haven’t had time to upgrade the version yet. The picture shows the PD monitoring.

translator_bot · June 22, 2024, 11:39am

| username: 考试没答案 | Original post link

I think you can wait a bit. You added 2 TiKV nodes at once.