Can I directly stop a TiKV node with disk issues, replace the disk, and then restart it?

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点磁盘异常,可以直接停止该节点 ,然后替换磁盘 重启吗

| username: TiDBer_an

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots / Logs / Monitoring]

May I ask, if a TiKV node (2 instances) is deployed on Alibaba Cloud’s ECS and currently there is a disk alert, requiring replacement and maintenance, after which the disk will be initialized, can I directly stop the two instances on that node, then after maintenance, restart the instances and wait for internal data balancing?

| username: WalterWj | Original post link

Scale up or down.

| username: TiDBer_an | Original post link

Scaling in and out requires waiting for internal data to go offline, which takes a bit of time. TiDB has multiple replicas, is there any risk with the above operation?

| username: WalterWj | Original post link

Is disk initialization a complete data format? In that case, the deployment directory and others would be cleared. The server recovery won’t be able to start. It’s better to perform normal scaling.

| username: TiDBer_an | Original post link

It is the initialization of the mounted disk. You can back up the deployment directory to the system disk, and this will not be formatted.

| username: WalterWj | Original post link

:thinking: That doesn’t seem impossible. Remember to adjust the max down time in PD to avoid data supplementation. Also, make sure to shut down all processes when backing up the directory.

Not sure if there are any pitfalls, but you can give it a try.

| username: tidb菜鸟一只 | Original post link

It is recommended to perform normal scaling operations for the two instances.

| username: TiDBer_an | Original post link

This type of backup does not include the data directory. After the node is restarted, will TiDB automatically replenish the data for this node?

| username: wangccsy | Original post link

You can’t stop it directly. If it causes inconsistency in transactions, it will result in fatal errors.

| username: dba远航 | Original post link

The standard practice should be followed: scale down first, then scale up.

| username: 像风一样的男子 | Original post link

Replace one KV at a time, not two together.

| username: TIDB-Learner | Original post link

Mount a large-capacity disk to the cloud host. Point the data path to it for expansion operations. Then, there is an issue with the instance that needs to be scaled down.

| username: andone | Original post link

First, expand with a new TiKV, then take the problematic TiKV node offline.

| username: FutureDB | Original post link

Generally, scaling up or down is used, which can also avoid impacting online business. Isn’t that better?

| username: zhanggame1 | Original post link

Deploy directly or is the risk too high, expand one by one?

| username: 路在何chu | Original post link

One at a time: expand one, replace it, then shrink one, and then expand again.

| username: Inkjade | Original post link

Please describe your cluster deployment situation, including the number of TiKV nodes and the overall cluster topology.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.