Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 如何快速替换一台TiKV机器?
Background: Due to machine adjustments, a new TiKV machine D needs to be added to replace another TiKV node A in the current cluster.
Question: How to replace the machine with minimal data migration?
Expectation: Region migration should only involve nodes A and D, and regions on other nodes (B, C) should remain unchanged.
PS: Direct scale-out + scale-in is definitely possible, but after scale-out, nodes A, B, and C will migrate to the new node D. During scale-in, the remaining regions of node A will be migrated to nodes B, C, and D, resulting in many unnecessary region migrations.
             
            
              
              
              
            
           
          
            
            
              
Currently, pt-ctl does not have a scheduler that meets my needs. The most similar one is shuffle-region-scheduler, but it schedules randomly and does not support migrating regions between two specific stores.
             
            
              
              
              
            
           
          
            
            
              
Normally, if you scale-out and then immediately scale-in, the regions on node A will only move to node D, and they won’t move to nodes B or C…
             
            
              
              
              
            
           
          
            
            
              
It seems that besides scaling up and down, there aren’t many good solutions.
             
            
              
              
              
            
           
          
            
            
              
Scaling up and down, nothing else.
             
            
              
              
              
            
           
          
            
            
              
Directly moving the disk to a new machine and changing the new machine’s IP to the old IP might be a possible approach.
We have done similar things with MySQL dual-master architecture, but not with TiDB.
With TiDB, we usually scale out first and then immediately scale in. The overall cluster remains relatively stable, with minimal impact.
             
            
              
              
              
            
           
          
            
            
              
First scale out, then scale in.
             
            
              
              
              
            
           
          
            
            
              
First expand, then shrink.
There is another method that hasn’t been practiced: first expand, then evict the leader, and then shrink. It feels about the same.
             
            
              
              
              
            
           
          
            
            
              
The official documentation only provides methods for scaling up and down. Other advanced techniques, such as changing servers, copying disk data without changing server IPs, have not been tested. To be safe, it’s better to follow the official guidelines.
             
            
              
              
              
            
           
          
            
            
              
The official source also doesn’t provide any other advanced techniques. Reducing the size first and then expanding it seems more reliable.
             
            
              
              
              
            
           
          
            
            
              
There is new cutting-edge technology to explore, but it’s safer to scale up or down.
             
            
              
              
              
            
           
          
            
            
              
 That’s indeed a good idea. I’ll experiment with it when I have time. However, the online environment has already been set up using the standard scaling method.
 That’s indeed a good idea. I’ll experiment with it when I have time. However, the online environment has already been set up using the standard scaling method.
             
            
              
              
              
            
           
          
            
            
              
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.