After a machine in the TiDB cluster fails, prepare to scale out and then take the faulty machine offline

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb 集群某一台机器出现故障后,准备扩容后,再将故障的机器下线。

| username: withseid

Currently, the TiDB cluster has three machines. Machine 37 has many commands that cannot be used due to issues with the gcc package, including commands like rm, pdkg, vim, etc., and SSH is also not connecting. However, strangely, when checking the status of each node in the cluster using tiup cluster display cluster_name, all nodes are shown as normal and UP.

The subsequent plan is to add a new machine to the cluster, and after the expansion is complete, take the nodes deployed on the problematic machine offline. Therefore, we need to expand two nodes now, namely PD and TiKV, to machine 27. After the expansion, an error occurred with the following content:

Error: init config failed: 10.20.70.37:12379: transfer from /home/tidb/.tiup/storage/cluster/clusters/rimedata/config-cache/pd-10.20.70.37-12379.service to /tmp/pd_c72ae045-ebb9-46c4-bdff-53857cbad262.service failed: failed to scp /home/tidb/.tiup/storage/cluster/clusters/rimedata/config-cache/pd-10.20.70.37-12379.service to tidb@10.20.70.37:/tmp/pd_c72ae045-ebb9-46c4-bdff-53857cbad262.service: ssh: handshake failed: read tcp 10.20.70.39:35672->10.20.70.37:22: read: connection reset by peer

Verbose debug logs have been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2022-06-22-02-09-55.log.
Error: run `/home/tidb/.tiup/components/cluster/v1.7.0/tiup-cluster` (wd:/home/tidb/.tiup/data/T9RrUFS) failed: exit status 1

The reason for this error is that it is unable to SSH into machine 37.

However, this error does not seem to affect the expansion. When checking the cluster status again using tiup cluster display cluster_name, both TiKV and PD have been successfully expanded.

Additionally, using the ctl command, we can see that the TiKV node on machine 27 is performing region balancing:

Starting component `ctl`: /home/tidb/.tiup/components/ctl/v5.3.0/ctl pd -u http://10.20.70.39:12379 store
{
  "count": 4,
  "stores": [
    {
      "store": {
        "id": 240720,
        "address": "10.20.70.27:20160",
        "labels": [
          {
            "key": "host",
            "value": "tikv4"
          }
        ],
        "version": "5.3.0",
        "status_address": "10.20.70.27:20180",
        "git_hash": "6c1424706f3d5885faa668233f34c9f178302f36",
        "start_timestamp": 1655891928,
        "deploy_path": "/ssd/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1655878605388129027,
        "state_name": "Up"
      },
      "status": {
        "capacity": "3.438TiB",
        "available": "3.16TiB",
        "used_size": "66.96GiB",
        "leader_count": 3214,
        "leader_weight": 1,
        "leader_score": 3214,
        "leader_size": 261006,
        "region_count": 3255,
        "region_weight": 1,
        "region_score": 287107.11842109315,
        "region_size": 264534,
        "slow_score": 1,
        "start_ts": "2022-06-22T09:58:48Z",
        "last_heartbeat_ts": "2022-06-22T06:16:45.388129027Z"
      }
    },
    {
      "store": {
        "id": 1,
        "address": "10.20.70.39:20160",
        "labels": [
          {
            "key": "host",
            "value": "tikv3"
          }
        ],
        "version": "5.3.0",
        "status_address": "10.20.70.39:20180",
        "git_hash": "6c1424706f3d5885faa668233f34c9f178302f36",
        "start_timestamp": 1654872483,
        "deploy_path": "/ssd/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1655878598630384863,
        "state_name": "Up"
      },
      "status": {
        "capacity": "1.718TiB",
        "available": "960.3GiB",
        "used_size": "649.6GiB",
        "leader_count": 9194,
        "leader_weight": 1,
        "leader_score": 9194,
        "leader_size": 753548,
        "region_count": 29676,
        "region_weight": 1,
        "region_score": 2986624.451327592,
        "region_size": 2416447,
        "slow_score": 1,
        "start_ts": "2022-06-10T14:48:03Z",
        "last_heartbeat_ts": "2022-06-22T06:16:38.630384863Z",
        "uptime": "279h28m35.630384863s"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "10.20.70.38:20160",
        "labels": [
          {
            "key": "host",
            "value": "tikv2"
          }
        ],
        "version": "5.3.0",
        "status_address": "10.20.70.38:20180",
        "git_hash": "6c1424706f3d5885faa668233f34c9f178302f36",
        "start_timestamp": 1654875638,
        "deploy_path": "/ssd/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1655878597208326791,
        "state_name": "Up"
      },
      "status": {
        "capacity": "1.718TiB",
        "available": "976GiB",
        "used_size": "647.7GiB",
        "leader_count": 9186,
        "leader_weight": 1,
        "leader_score": 9186,
        "leader_size": 744136,
        "region_count": 29694,
        "region_weight": 1,
        "region_score": 2986642.1116330107,
        "region_size": 2420897,
        "slow_score": 1,
        "sending_snap_count": 1,
        "start_ts": "2022-06-10T15:40:38Z",
        "last_heartbeat_ts": "2022-06-22T06:16:37.208326791Z",
        "uptime": "278h35m59.208326791s"
      }
    },
    {
      "store": {
        "id": 5,
        "address": "10.20.70.37:20160",
        "labels": [
          {
            "key": "host",
            "value": "tikv1"
          }
        ],
        "version": "5.3.0",
        "status_address": "10.20.70.37:20180",
        "git_hash": "6c1424706f3d5885faa668233f34c9f178302f36",
        "start_timestamp": 1654853824,
        "deploy_path": "/ssd/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1655878601489828812,
        "state_name": "Up"
      },
      "status": {
        "capacity": "1.718TiB",
        "available": "979.4GiB",
        "used_size": "652.6GiB",
        "leader_count": 9192,
        "leader_weight": 1,
        "leader_score": 9192,
        "leader_size": 749305,
        "region_count": 29734,
        "region_weight": 1,
        "region_score": 2986728.814431181,
        "region_size": 2422184,
        "slow_score": 1,
        "start_ts": "2022-06-10T09:37:04Z",
        "last_heartbeat_ts": "2022-06-22T06:16:41.489828812Z",
        "uptime": "284h39m37.489828812s"
      }
    }
  ]
}

However, when checking the number of instances in Grafana’s rimedata-Overview, the newly expanded PD and TiKV instances are not found. At this point, the number of PD and TiKV should be 4.

Additionally, the data on the tikv-details page is also problematic, as the information for the newly expanded TiKV nodes is not visible.

I would like to ask if it is normal that the newly expanded TiKV nodes’ information is not shown in Grafana after the expansion is successful?

| username: ddhe9527 | Original post link

Try changing the time range displayed in Grafana to after the expansion.

| username: withseid | Original post link

The time in the upper right corner?

| username: HACK | Original post link

Yes, you can select the time range in the upper right corner.