Is it safe to restart servers for maintenance in tikv cluster?

If I need to perform system maintenance on the underlying servers running a multi node tikv cluster with a lot of data on each node, is it safe to simply shutdown and restart the operating systems for each node (one at a time) and let the tikv cluster self recover? or is it necessary to fully drain and remove each node from the cluster one at a time and add it back after, which could take many days and cause a lot of disk i/o and network load?

I have not tried rolling reboots of cluster nodes for os maintenance for fear of data corruption

Restarting servers for maintenance in a TiKV cluster can be safe if done properly. When performing system maintenance on the underlying servers running a multi-node TiKV cluster, you can follow certain steps to ensure the safety and stability of the cluster:

  1. Before restarting a node, ensure that the other TiKV nodes have enough capacity to handle the data that will be temporarily unavailable during the maintenance. You can check the disk capacity of other nodes and monitor the cluster’s status to ensure smooth operation .

  2. It is recommended to restart the TiKV nodes after restarting the Placement Driver (PD) to ensure proper connectivity. You can use the following command to restart a TiKV node:

    tiup cluster restart <cluster-name> --node tikv:<port>
    

    This will help in maintaining the connectivity and synchronization within the cluster .

  3. Monitoring the cluster during and after the maintenance is crucial. You can use TiDB Grafana to monitor various metrics such as the creation and completion of tasks related to data migration. This will help you ensure that the cluster is functioning properly post-restart .

  4. If you encounter any issues during or after the restart, such as slow write performance, you can troubleshoot by checking system logs, evicting leader schedulers, or even stopping the TiKV process on problematic nodes temporarily. This can help in identifying and resolving any underlying issues causing performance degradation .

By following these steps and monitoring the cluster closely, you can safely restart servers for maintenance in a TiKV cluster without risking data corruption. However, it is always recommended to have backups in place before performing any maintenance activities to mitigate any unforeseen issues.