Do I need to remove a host from the TiDB cluster if it goes down?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb集群中1台主机停机需要将其剔除集群吗?

| username: xiaohetao

Does this host need to be removed from the cluster after stopping all kv instances on it without stopping the business?

| username: Jolyne | Original post link

I understand that you are just changing the hardware and not scaling down. If the number of TiKV instances meets the minimum requirement after shutdown, there will be no impact, and there is no need to remove them from the cluster.

| username: zhimadi | Original post link

tiup cluster stop --node

| username: hey-hoho | Original post link

Are you asking if you need to replace the TiKV disk when changing hardware?

| username: 像风一样的男子 | Original post link

What hardware needs to be replaced? If replacing the disk, do these three KVs need to be shrunk?

| username: xiaohetao | Original post link

Change the CPU

| username: xiaohetao | Original post link

Yes, just changing the hardware without scaling down.
The cluster topology is: apart from the control node, there are 6 hosts in total. 3 hosts each have 2 TiDB instances and 1 PD instance, and the other 3 hosts each have 3 TiKV instances. This one host shutdown will stop 3 TiKV instances.

| username: xiaohetao | Original post link

Replace CPU

| username: hey-hoho | Original post link

Then just stop it. Increase the PD parameter store max downtime to cover the time needed for your hardware replacement (default is 30 minutes). Once the machine is fixed, just start TiKV.

| username: zhanggame1 | Original post link

Replacing hardware on the host requires downtime. There’s no need to remove it from the cluster. Just shut down the services on this machine, and it will automatically rejoin the cluster after rebooting. There’s no need to adjust the store max downtime parameter for PD either. In our test cluster, TiKV often goes down, and I’m used to it.

| username: Kongdom | Original post link

It should not need to be removed. It is recommended to make a backup. If possible, it is best to stop the entire cluster and then replace it.

| username: xiaohetao | Original post link

Can the max-store-down-time parameter be modified online without any impact on the cluster?

| username: hey-hoho | Original post link

Sure, you can modify it using pd-ctl.

| username: xiaohetao | Original post link

Okay, I’ll find an environment to try it out.

| username: dba远航 | Original post link

Directly stop the process to replace the CPU, and restart after the replacement is done.

| username: 普罗米修斯 | Original post link

If a single node has 3 TiKV instances and the label is at the host level, downtime will not be affected. You can simply increase max-store-down-time.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.