Automatic Balancing After Expanding TiKV

translator_bot · June 22, 2024, 4:13pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 扩容tikv后为自动均衡

| username: TiDBer_ZfFjmcZo

[TiDB Usage Environment] Testing
[TiDB Version] V6.4
[Reproduction Path] Use scale-out to expand TiKV, expanding from 1 TiKV to 3 TiKVs per node. Expanded twice, each time adding 3.
[Encountered Issue: Phenomenon and Impact] After expansion, the TiKV status appears normal, but regions are not automatically balanced. It is uncertain whether the expanded TiKVs are properly activated.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 22, 2024, 4:13pm

| username: xfworld | Original post link

Why not use 6.5.X? Why use 6.4?

If the service is running normally, automatic balancing needs to wait because scheduling and execution take time.

translator_bot · June 22, 2024, 4:13pm

| username: buddyyuan | Original post link

Check the schedule status and config to see if these parameters are correct.

translator_bot · June 22, 2024, 4:13pm

| username: TiDBer_ZfFjmcZo | Original post link

I installed it a long time ago, and back then the latest version was 6.4. It’s been a whole night already, and with just over 100GB of data, it shouldn’t take this long, right?

translator_bot · June 22, 2024, 4:13pm

| username: buddyyuan | Original post link

You can check cluster-pd → Operator → Schedule operator create to see if there is any scheduling generated.

translator_bot · June 22, 2024, 4:13pm

| username: h5n1 | Original post link

It looks like your monitoring has had no data for a long time, try restarting Prometheus. Check the leader_count and region_count in information_schema.tikv_store_status to see if they are balanced.

translator_bot · June 22, 2024, 4:13pm

| username: TiDBer_ZfFjmcZo | Original post link

There is scheduling.

translator_bot · June 22, 2024, 4:13pm

| username: TiDBer_ZfFjmcZo | Original post link

The parameters are expanded as shown in the screenshot, so there should be no problem, right? Can multiple TiKV instances be deployed on a single physical machine? I remember that as long as the ports are different, it should be successful, right?

translator_bot · June 22, 2024, 4:13pm

| username: tidb菜鸟一只 | Original post link

PD → Statistics Balance to view the store region scores and the number of regions on each node.

translator_bot · June 22, 2024, 4:13pm

| username: TiDBer_ZfFjmcZo | Original post link

There is monitoring data. The results in the screenshot are the same as querying the system table.

translator_bot · June 22, 2024, 4:13pm

| username: TiDBer_ZfFjmcZo | Original post link

Indeed, the scores are unbalanced. Each node has 3 TiKV instances, but there are 4 stores. It’s similar to node 145 in the screenshot, where one is very high and the other two are very low.

translator_bot · June 22, 2024, 4:13pm

| username: h5n1 | Original post link

Is there a label set? Has the newly added TiKV been set?

translator_bot · June 22, 2024, 4:13pm

| username: Kongdom | Original post link

It looks like a PD Leader switch occurred in your diagram. Try switching the instance in the top left corner and see.

translator_bot · June 22, 2024, 4:13pm

| username: dba-kit | Original post link

It’s actually quite normal to process 100G overnight. Try adjusting the add peer speed using pdctl store limit. By default, TiKV scheduling is indeed slow. I believe the default value is 150; you can directly adjust it to 2000 and then revert it after the data is supplemented.