Issues with Maintenance and Shutdown of Faulty TiDB Nodes

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb故障节点停机维护问题

| username: LBX流鼻血

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1
Dear experts,
One of the nodes has a memory failure, and the data center says it needs to be shut down to replace the memory. I checked some information, and some say it can be stopped directly, while others say only the TiDB node can be stopped, and the others need to be scaled down and taken offline first, then scaled up again. But I have multiple nodes; can I just stop it directly, replace the memory, and then bring it back up?

| username: TiDBer_jYQINSnf | Original post link

No, you can’t. If you have many nodes without grouping, they are intermingled. If any two nodes go down, most of the data replicas will become unavailable.

| username: TiDBer_QYr0vohO | Original post link

How did you deploy it? Is it a hybrid deployment?

| username: Jellybean | Original post link

If the original poster does not have mixed deployment, based on the screenshot you posted, the nodes for PD, TiDB, and TiKV have failed.

For the TiDB frontend load balancing layer, you can remove this node and proceed with maintenance directly. If the PD leader node is on another normal machine, you can also directly maintain the failed one, but make sure to restore and bring it back online promptly to ensure an odd number of PD nodes are alive (3).

If three TiKV nodes fail at once, it is likely due to mixed deployment on the same machine. Apart from the momentary downtime affecting requests to these TiKVs, other accesses should be normal. The leader on the failed machine will quickly switch to other surviving region replicas, ensuring no disruption to subsequent cluster services. Considering there are still many TiKV nodes in this cluster, after more than 30 minutes, the cluster will automatically replenish data to three replicas on other machines. You can also proceed with maintenance directly, and the node will automatically rejoin the cluster once restored.

After the machine is restored, theoretically, the nodes will automatically start and join the cluster. Just ensure thorough inspection.

| username: Kongdom | Original post link

I think the failed nodes are already out of contact now, and tiup probably can’t manage these nodes either. You can proceed with direct shutdown maintenance.

| username: zhanggame1 | Original post link

If there’s a memory failure, just replace the memory and it should be fine. Generally, it will automatically recover, and you don’t need to do anything.

| username: YuchongXU | Original post link

It is recommended to deploy separately.

| username: 像风一样的男子 | Original post link

Memory failure, this server definitely can’t be used. The nodes are all disconnected, so there’s no need to scale down. You can directly shut it down for maintenance.

| username: 呢莫不爱吃鱼 | Original post link

Just do it, no big deal.

| username: jiayou64 | Original post link

I think there are two situations: 1. The memory failure caused a crash; 2. It’s just a memory failure, and the system restarted.
For the former, you can directly replace the physical memory. For the latter, you should use tiup to bring the node back to normal state before proceeding.
When a TiKV node cannot function properly, its status will change to Disconnected, and after 30 minutes, it will change to Down status.

  1. TiKV node maintenance:
    First, set the leader weight of the TiKV node to be processed to 0, and add a task to schedule its leaders to other nodes. At this point, you can safely stop the TiKV service, start it after maintenance, restore the weight, and delete the scheduling task.
  2. PD node maintenance:
    Move the member leader node, delete the PD node to be processed, clear the cache after recovery, and rejoin the cluster.
| username: TiDBer_HUfcQIJx | Original post link

You can stop directly.

| username: lemonade010 | Original post link

Just do it, there’s no problem with TiDB, but it looks like your TiKV instances are mixed on the same machine, which isn’t very good.

| username: lemonade010 | Original post link

If I’m not mistaken, you have PD, TiDB, and several replicas of TiKV on one machine. This is quite risky. If you have three replicas, two of them would be on the machine you’re replacing.

| username: tidb菜鸟一只 | Original post link

The key issue is that this machine seems to have already crashed. One PD and one TiDB instance have crashed, and three TiKV instances have crashed. If you have set up label isolation or are lucky, the cluster might still be functioning normally. In that case, you can simply replace the memory on this machine and restart it. If the cluster is no longer functioning, you can wait for this machine to restart and see if it can recover.

| username: WinterLiu | Original post link

I think directly maintaining and restarting is sufficient, no need to take it offline and then online again.

| username: yytest | Original post link

TiDB Node:
TiDB nodes are stateless and can be directly stopped and have their memory replaced. After replacement, simply restart the node. Other TiDB nodes in the cluster can continue to handle requests.

TiKV Node:
TiKV nodes are stateful, and directly stopping them may lead to data loss or decreased cluster performance. It is recommended to first use TiUP or other cluster management tools to scale down the TiKV node, wait for the data to migrate to other nodes, and then proceed with memory replacement. After replacement, use the cluster management tool to scale up and rejoin the node to the cluster.

PD Node:
PD nodes are also critical components of the cluster, and directly stopping them may affect the cluster’s scheduling and management functions. It is recommended to first use the cluster management tool to scale down the PD node, wait for the leader to transfer to another node, and then proceed with memory replacement. After replacement, scale up again.

| username: TiDBer_RjzUpGDL | Original post link

It looks like it can be stopped directly.

| username: tony5413 | Original post link

The problem is not clearly described, not sure if there is mixed deployment.