TiFlash Node Cannot Be Forcefully Scaled Down

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash 节点无法强制缩容

| username: Hacker_ojLJ8Ndr

To scale down an abnormal node, I used tiup to scale down but couldn’t complete the process due to the node’s abnormality. Later, I forced the scale down, and the node was no longer visible in the cluster topology when checked with tiup cluster display, but it was still present on PD. After executing pd-ctl store delete, it succeeded, but the store was still in offline status when checked again. Currently, all TiFlash replicas have been cleaned up, but this node still cannot be removed.

| username: tidb狂热爱好者 | Original post link

It’s best to upload screenshots and logs of your node.

| username: Billmay表妹 | Original post link

Based on your description, you want to scale down an abnormal node, but the process cannot be completed due to the node’s abnormality. You then forced the scale-down, but it still appears in PD. After executing pd-ctl store delete, it succeeded, but the store is still in an offline state. Currently, all TiFlash replicas have been cleaned up, but the node cannot be removed. Based on your description, I can provide the following suggestions:

  1. First, you can try using the pd-ctl store delete command to delete the store. This command will remove the store from PD. If the store still exists in PD, you can try using the pd-ctl store delete command to delete the store, as shown below:

    pd-ctl store delete <store_id>
    

    Here, <store_id> is the ID of the store to be deleted. After executing this command, you can use pd-ctl store to check the store’s status. If the store’s status is offline, it means the store has been successfully deleted.

  2. If you have already tried the pd-ctl store delete command but the store still exists in PD, you can try using the TiUP command to delete the store. Specifically, you can use the following command:

    tiup cluster scale-in <cluster-name> -N <store-id>
    

    Here, <cluster-name> is the name of your TiDB cluster, and <store-id> is the ID of the store to be deleted. After executing this command, TiUP will automatically remove the store from the cluster and update the cluster topology. If the store’s status is still offline, you can try restarting the TiKV process to make it rejoin the cluster.

  3. If the above methods do not solve the problem, you can try manually deleting the store. Specifically, you can follow these steps:

    • Stop the TiKV process on the store.
    • Delete the data directory on the store.
    • Delete the store in PD using the pd-ctl store delete command.
    • Delete the store in TiUP using the tiup cluster scale-in command.
    • Start the TiKV process on the store.

    After completing the above steps, you can use pd-ctl store to check the store’s status. If the store’s status is up, it means the store has been successfully deleted.

| username: Hacker_ojLJ8Ndr | Original post link

Configuration during scaling:
image

After scaling, there was an issue with the node, the disk was abnormal, so we scaled down.
tiup cluster display no longer shows this node.

pd-ctl store delete was successful:

However, when querying again, this node is still offline:

| username: Hacker_ojLJ8Ndr | Original post link

According to the official documentation, all rules have been deleted, but the node still hasn’t been removed.

| username: Hacker_ojLJ8Ndr | Original post link

After executing unsafe remove-failed-stores 3048128666, the node status changed to Tombstone, but an error occurred when executing store remove-tombstone:

| username: h5n1 | Original post link

Does tiup cluster display show as tombstone? Execute tiup cluster prune.

| username: Hacker_ojLJ8Ndr | Original post link

The node is no longer visible in tiup cluster display.

| username: 像风一样的男子 | Original post link

Try cleaning up the remnants in PD with pd-ctl -u http://pd_ip:2379 store remove-tombstone.

| username: h5n1 | Original post link

Check the status of this store again with pd-ctl store.

| username: h5n1 | Original post link

The region hasn’t fully migrated yet. Use pd-ctl region store xxx to check the store’s regions, then use pd-ctl operator add remove xxx to remove these from TiFlash, and then remove the tombstone-store.

| username: Hacker_ojLJ8Ndr | Original post link

I just re-executed the store remove-tombstone command. Although this node is not displayed on the interface, it still appears in the output of unsafe remove-failed-stores show. Does this need to be addressed?

| username: h5n1 | Original post link

If you can see it in pd-ctl store, it means it is still there.

| username: Hacker_ojLJ8Ndr | Original post link

There is another issue. Yesterday, after performing scale-in … --force, the store status was offline (the installation directory was cleaned up after forced offline, but the data directory was not cleaned up due to a failure, and later the data directory was manually deleted). After executing unsafe remove-failed-stores, the status changed to tombstone. Then, executing store remove-tombstone did not succeed. Later, I re-added the TiFlash replica to the table without performing any other operations. Today, I see that the store’s region count is still 23, but I can now execute store remove-tombstone. What could be the reason that allows store remove-tombstone to succeed today?

| username: Hacker_ojLJ8Ndr | Original post link

pd-ctl store shows that it is no longer there, but unsafe remove-failed-stores show still shows it. Does this need to be addressed?

| username: h5n1 | Original post link

It’s just a record of handling failed replica operations, no need to worry about it.

| username: h5n1 | Original post link

The unsafe remove-failed-stores command will forcibly clean up the replicas on the store. The offline status occurs during scale-in or after store deletion when the region is being migrated. Once all migrations are complete, it becomes tombstone.

| username: Hacker_ojLJ8Ndr | Original post link

There is an offline node, and because pd-ctl store remove-tombstone was executed first, tiup cluster display shows the status as N/A. Executing prune returns Error: no store matching address “192.168.x:20171” found. How should this be handled?

| username: h5n1 | Original post link

After using pd-ctl store remove-tombstone, if pd-ctl store cannot see it, just use scale-in --force.

| username: Hacker_ojLJ8Ndr | Original post link

The central control machine can no longer connect to the store. Is this forced cleanup internal to PD? After executing unsafe remove-failed-stores, although the store’s status is tombstone, there are still regions present when using pd-ctl store on this node, but it is possible to remove-tombstone.