How to Quickly Replace a TiKV Machine?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 如何快速替换一台TiKV机器?

| username: dba-kit

Background: Due to machine adjustments, a new TiKV machine D needs to be added to replace another TiKV node A in the current cluster.
Question: How to replace the machine with minimal data migration?
Expectation: Region migration should only involve nodes A and D, and regions on other nodes (B, C) should remain unchanged.

PS: Direct scale-out + scale-in is definitely possible, but after scale-out, nodes A, B, and C will migrate to the new node D. During scale-in, the remaining regions of node A will be migrated to nodes B, C, and D, resulting in many unnecessary region migrations.

| username: dba-kit | Original post link

Currently, pt-ctl does not have a scheduler that meets my needs. The most similar one is shuffle-region-scheduler, but it schedules randomly and does not support migrating regions between two specific stores.

| username: tidb菜鸟一只 | Original post link

Normally, if you scale-out and then immediately scale-in, the regions on node A will only move to node D, and they won’t move to nodes B or C…

| username: zhanggame1 | Original post link

It seems that besides scaling up and down, there aren’t many good solutions.

| username: redgame | Original post link

Scaling up and down, nothing else.

| username: Jellybean | Original post link

Directly moving the disk to a new machine and changing the new machine’s IP to the old IP might be a possible approach.

We have done similar things with MySQL dual-master architecture, but not with TiDB.

With TiDB, we usually scale out first and then immediately scale in. The overall cluster remains relatively stable, with minimal impact.

| username: 天蓝色的小九 | Original post link

Scale up or down.

| username: Fly-bird | Original post link

First scale out, then scale in.

| username: Kongdom | Original post link

First expand, then shrink.

There is another method that hasn’t been practiced: first expand, then evict the leader, and then shrink. It feels about the same.

| username: 像风一样的男子 | Original post link

The official documentation only provides methods for scaling up and down. Other advanced techniques, such as changing servers, copying disk data without changing server IPs, have not been tested. To be safe, it’s better to follow the official guidelines.

| username: cassblanca | Original post link

The official source also doesn’t provide any other advanced techniques. Reducing the size first and then expanding it seems more reliable.

| username: 大飞哥online | Original post link

There is new cutting-edge technology to explore, but it’s safer to scale up or down.

| username: anteguitado | Original post link

You can try this method:

  1. Set labels for A/B/C as host=A, host=B, host=C respectively.
  2. Set the isolation level to host:
    pd-ctl config set location-labels host
    pd-ctl config set isolation-level host
    
  3. Add node D and set the label to host=A.
  4. Decommission node A.
| username: dba-kit | Original post link

:thinking: That’s indeed a good idea. I’ll experiment with it when I have time. However, the online environment has already been set up using the standard scaling method.

| username: dba-kit | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.