After scaling in a TiKV node, other nodes report Connection refused

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV缩容一个节点后,其他节点报 Connection refused

| username: magdb

[TiDB Usage Environment] Production Environment
[TiDB Version] 7.1.1
[Reproduction Path]

  1. Shut down a TiKV node normally (tiup cluster stop cluster -N 192.168.1.XXX:20160)
  2. Scale down the node, the scale-down completes, but the node status shows as N/A
  3. Use the --force option to forcibly remove the node, the node disappears from the cluster
  4. Check the logs of other TiKV nodes, and find errors attempting to connect to this node with “Connection refused”
    [Encountered Problem: Problem Phenomenon and Impact]
    I would like to ask how to clear this error from the TiKV logs

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

[Attachments: Screenshots/Logs/Monitoring]

| username: Jellybean | Original post link

The standard operation for scaling in is tiup scale-in. Just to confirm, did you use stop?

| username: 小龙虾爱大龙虾 | Original post link

Why do you need to force scale down? Why do you need to stop before scaling down? If you force scale down, just wait. Once the region replica is completed, it will be fine.

| username: 像风一样的男子 | Original post link

If you didn’t follow the normal process for scaling in, you don’t need to stop; you can directly use tiup scale-in. Don’t misuse the force scale-in option --force as it can cause issues. Can this KV node still start now?

| username: xfworld | Original post link

Try this


Special Handling for Decommissioning

Since the decommissioning of TiKV, TiFlash, and TiDB Binlog components is asynchronous (requiring removal operations via API first) and the decommissioning process takes a long time (requiring continuous observation to see if the node has been successfully decommissioned), special handling has been implemented for TiKV, TiFlash, and TiDB Binlog components:

  • For operations on TiKV, TiFlash, and TiDB Binlog components:
    • tiup-cluster will exit immediately after decommissioning them via API without waiting for the decommissioning to complete.
    • Execute tiup cluster display to check the status of the decommissioned nodes and wait for their status to change to Tombstone.
    • Execute the tiup cluster prune command to clean up Tombstone nodes. This command will perform the following operations:
      • Stop the services of the decommissioned nodes.
      • Clean up the related data files of the decommissioned nodes.
      • Update the cluster topology to remove the decommissioned nodes.
| username: oceanzhang | Original post link

After stopping, it is not directly taken offline, but cleaned up after scaling in.

| username: magdb | Original post link

First, stop this node, and then perform the scale-in after all regions have been migrated to other nodes.

| username: magdb | Original post link

Because there was an issue with this node, we initially decided to stop and observe it. All regions had already been migrated to other nodes. After some time, we decided to scale in without restarting it. After executing the scale-in, we found that the status of this node became N/A. We then used the -force option. Currently, this node is no longer in the cluster, and it is not visible in PD’s store. However, logs from other nodes still show connections to this node. :joy:

| username: tidb菜鸟一只 | Original post link

Do all TiKV nodes need to be reloaded?

| username: h5n1 | Original post link

Check the overview → PD → abnormal stores monitoring.

| username: magdb | Original post link

It shows like this

| username: 像风一样的男子 | Original post link

Check out the expert’s three tricks for handling abnormal offline situations:

| username: Jellybean | Original post link

Using --force means ignoring the internal offline state changes of this node, the registration connections between nodes, etc., and forcibly and directly erasing the directory and data of this node. This operation is generally not used lightly, and once used, it must be in extreme scenarios; otherwise, various undefined exceptions are likely to occur. It should be used with caution in the future.

The good thing is that before forcibly erasing its data, you confirmed that the data had already been migrated, so there would be no data loss. Other nodes still have its information, so you can first confirm if there is any impact. If there is no impact, generally, there should be no major issues. You can find an appropriate time window to restart the TiKV cluster, which should clear out this information. To be more cautious, you can first simulate and verify its feasibility in a test environment.

| username: h5n1 | Original post link

Use pd-ctl region and then grep to check if there is any store that you have taken offline.

| username: magdb | Original post link

There are no records of offline nodes in the store. Is there any other way to find them? They are no longer in PD. :sweat_smile:

| username: magdb | Original post link

Okay, I’ll find some free time to test it. Thanks, boss.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.