How to Recover from Complete Data Loss in TiUP?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiUP持久化数据完全丢失,如何恢复?

| username: Kyle

【TiDB Usage Environment】Production
【TiDB Version】4.x
【Encountered Problem】The machine where TiUP is located has completely crashed and cannot manage the cluster
【Reproduction Path】None
【Problem Phenomenon and Impact】

There is a 4.x environment here with several components such as pd, tidb, tikv, tiflash, etc., distributed across more than 10 machines. Normally, its deployment and maintenance are managed through the central control machine TiUP. However, the current problem is that the machine where the central control machine is located has failed and is gone, and there is no backup of the persistent resources related to tiup; resulting in the inability to perform maintenance, migration, and other operations on this cluster. I would like to ask if it is possible to downgrade to manual cluster maintenance in this situation (I have not seen any official documentation on manual maintenance), or if there is a way to rebuild TiUP.

| username: xfworld | Original post link

Find a new control machine to rebuild the environment.

  1. Install the tiup toolkit.

  2. Configure the information of the original cluster, topology.yaml (you need to enter the information of the original cluster nodes).

  3. tiup cluster deploy tidb-xxx ./topology.yaml

  4. tiup cluster display tidb-xxx

Check the status of the cluster…

It is recommended to set up a scheduled task to regularly back up the tiup environment information.

| username: Kyle | Original post link

Is the operation of tiup cluster deploy tidb-xxx ./topology.yaml strictly idempotent? The cluster information can exist, but it cannot be fully confirmed. I am afraid that the operation might overwrite the original resources if an error occurs.

| username: Hacker_5IopoGHq | Original post link

You need to check the parameters of the PD, TiDB, and TiKV instances on each node one by one to avoid parameter overrides and inconsistencies with previous settings. This operation is idempotent.

| username: Kyle | Original post link

Manually tore up the configuration file, the cluster has been restored.

| username: 啦啦啦啦啦 | Original post link

:call_me_hand::call_me_hand::call_me_hand: It’s better to regularly perform tiup backups to avoid similar situations in the future.

| username: Kyle | Original post link

Well, it would be great if the tiup tool could be more seamlessly integrated with git, similar to Ansible, where all the data that needs to be persisted is stored on git, following the so-called gitops approach. :joy:

| username: forever | Original post link

Databases definitely should not be exposed to the external network… It would be great if it could automatically send a copy to all machines in the cluster, as the probability of all machines in the cluster failing is very low. :grimacing:

| username: alfred | Original post link

Key configuration files still need to be backed up.

| username: HACK | Original post link

In what configuration changes of the cluster is it necessary to back up the .tiup directory? Topology structure? Cluster configuration file?

| username: Kyle | Original post link

Any modification commands must be backed up. Tiup now has a backup command, but you still need to write your own backup scripts and such. I am planning to put the underlying tiup on the cloud drive.

| username: Kyle | Original post link

Are you referring to GitHub on the external network? I was talking about private deployment of Git servers like GitLab on the corporate intranet.

| username: forever | Original post link

That works. My idea is that if it can be done without intrusion, TiDB can automatically distribute backups to other nodes similar to TiKV’s multiple replicas. :grin:

| username: Kyle | Original post link

Similar to command-line tools like k8s kubectl and ceph, it can be installed on all nodes of the cluster, with persistent data stored within the cluster itself. If it has import and export functions, it would be perfect.

| username: jansu-dev | Original post link

This part should not be confidential, you can manually install tiup;
Then, manually supplement the meta.yaml (complete it according to memory), and then reload. The IP, port, configuration, and label here must be correct;
After reloading, it should be able to recover. The corresponding information can be found through the existing machine using pd-ctl get storeInfo. If the parameters can connect to the database, they can be found in cluster_config.