After scaling down, can the backup made before scaling down still be restored? Does the environment for restoration need to be identical to the original cluster environment?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 缩容后,缩容前做的备份还能恢复吗?做恢复的环境一定要和原集群环境一样吗?

| username: TiDBer_BwNZ5U9X

[Test Environment for TiDB] Testing
[TiDB Version] 6.5.0 linux
[Reproduction Path] Scale down the TiKV node, restore the full backup before scaling down, and encounter a connection error during restoration
[Encountered Problem: Phenomenon and Impact]
<---------------------------…> 33.80% Full Restore <--------------------------.…> 33.80% Full Restore <--------------------------|…> 33.80% Full Restore <-----------------------------------------------------------------------------> 100.00% [2024/04/24 14:59:39.179 +08:00] [INFO] [collector.go:69] [“Full Restore failed summary”] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0] [split-region=1.56078ms] [restore-ranges=73]
Error: connection error: desc = “transport: error while dialing: dial tcp 10.10.21.103:20161: connect: connection refused”
[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]


Cluster information after scaling down
Used the command tiup cluster scale-in tidb-test --node 10.10.21.103:20161 for scaling down. After successful scaling down, the node remained in Pending Offline status.

Later, used the --force command to completely remove the node.

| username: TiDBer_jYQINSnf | Original post link

With only 3 TiKVs, it can’t be scaled down, right?
I don’t think there should be any issues with backup and recovery.

| username: zhaokede | Original post link

The data being restored is not relevant.

| username: TiDBer_BwNZ5U9X | Original post link

Does it mean that it has nothing to do with the node structure?

| username: TiDBer_BwNZ5U9X | Original post link

So, theoretically, it is possible to recover from an inconsistent number of cluster nodes?

| username: 舞动梦灵 | Original post link

Restoration is not an issue, but if you have 3 replicas by default, it’s best to ensure that TiKV has more than 3 servers.

| username: 像风一样的男子 | Original post link

The number of TiKV nodes cannot be less than the number of replicas (default is 3), otherwise, the scale-down will fail. Additionally, the backup log you posted above shows that the backup failed. It is recommended that you first scale up a KV node to restore the cluster.

| username: DBAER | Original post link

At least three replicas are needed to keep the cluster running normally.

| username: TiDBer_BwNZ5U9X | Original post link

That’s indeed the case. I tried a four-node backup and restored it to three nodes successfully. Thank you, everyone.

| username: TiDBer_jYQINSnf | Original post link

Backup and recovery do not need to have the same number of nodes as before; regions are constantly changing.

| username: TiDBer_BwNZ5U9X | Original post link

Indeed, that’s the case. Thank you.

| username: TiDBer_BwNZ5U9X | Original post link

Just to clarify, what do you mean by “replica”? Is it a specific component?

| username: zhanggame1 | Original post link

Backup only involves backing up the leaders of multiple replicas and is unrelated to the number of nodes.

| username: TiDBer_BwNZ5U9X | Original post link

Uh, what does this replica refer to, PD?

| username: TiDBer_BwNZ5U9X | Original post link

May I ask, what does this replica refer to? Is it PD?

| username: zhanggame1 | Original post link

By default, TiDB stores data in three copies, distributed across different TiKV nodes, with one leader and two followers.

| username: TiDBer_BwNZ5U9X | Original post link

Okay, thank you.

| username: TiDBer_QYr0vohO | Original post link

The number of machines needs to be greater than the number of replicas, and it is not necessary to match the original cluster when restoring data.

| username: 像风一样的男子 | Original post link

The default configuration for a TiDB cluster is 3 replicas. Each Region will have 3 copies stored in the cluster, and they use the Raft protocol to elect a Leader and synchronize data. The Raft protocol ensures that the service can still be provided and no data will be lost even if the number of failed or isolated nodes is less than half of the number of replicas (note: not the number of nodes).