I have to create DR of a 50TB+ TiDB cluster

Application environment:

Production

TiDB version: TiDB-v4.0.14

Problem:

I have to take backup of 50TB+ TiDB backup and restore it on a new TiDB cluster as a part of our DR Strategy. To take backup using BR we have make sure tidb_gc_enable=FALSE or we have to set it to a high value so that garbage collection gets enabled after TiCDC catches up. But if we disable this then space and performance for production is compromised. Is there any other methodology that we can adopt as backing up and restoring such a huge dataset will take a long time and hampering performance and increased Storage Usage doesn’t look feasible.

Here is the detailed Disaster Recovery (DR) plan for a 50TB+ TiDB cluster:

Creating a comprehensive Disaster Recovery (DR) plan for a 50TB+ TiDB cluster involves several key components, including backup strategies, incremental backup options, replication using TiCDC, and leveraging cloud storage solutions. Here’s a detailed plan:

1. Backup Strategies

Full Backup

  • Backup & Restore (BR): For large datasets, BR is the preferred tool for full backups. It allows you to back up the entire cluster data at a specific point in time. BR can handle large volumes efficiently and supports both full and incremental backups. For more details, refer to the TiDB Backup & Restore Overview.

Alternative Backup Methods

  • Dumpling: For smaller datasets or specific tables, Dumpling can be used to export data. It is less efficient for very large datasets compared to BR but can be useful for targeted backups.
  • Cloud Storage: Utilize cloud storage solutions like Amazon S3, Google Cloud Storage (GCS), or Azure Blob Storage for storing backup data. This ensures data durability and accessibility in case of a disaster. More information can be found in the TiDB Cloud Backup and Restore documentation.

2. Incremental Backup Options

  • Incremental Backups with BR: Incremental backups capture only the changes made since the last backup, reducing storage requirements and backup time. Use the --lastbackupts option with BR to specify the last backup timestamp and capture only the changes since then. This method is efficient for maintaining up-to-date backups without the overhead of full backups. For more details, see the Incremental Backup Guide.

3. Replication with TiCDC

  • TiCDC for Real-Time Replication: TiCDC is a tool designed to replicate incremental data changes from TiDB to downstream systems like MySQL, Kafka, or other TiDB clusters. It supports scenarios such as high availability and disaster recovery by ensuring eventual consistency between primary and secondary clusters. TiCDC can handle large transactions and provides options for bidirectional replication. For more information, refer to the TiCDC Overview.

4. Cloud Storage Solutions

  • TiDB Cloud: Utilize TiDB Cloud for managed database services, which include built-in backup and restore capabilities. TiDB Cloud supports both snapshot and log backups, allowing for point-in-time recovery. This service ensures high availability and data security with features like multi-AZ deployments and automatic failover. For more details, see the TiDB Cloud Introduction.

Performance and Storage Impact Minimization

  • Resource Management: Adjust the number of backup threads and use rate limiting to minimize the impact on cluster performance during backup operations. This can be configured in the BR tool settings.
  • Scheduling: Schedule backups during off-peak hours to reduce the impact on transactional workloads.
  • Compression: Use data compression options available in BR to reduce storage space requirements for backups.

By implementing these strategies, you can ensure a robust disaster recovery plan for your TiDB cluster, minimizing downtime and data loss in the event of a disaster. For further details, you can explore the provided links to the official documentation.

@prateekrckstr Could you please let us know how long your current backup process takes? If the GC is stuck for a day, it generally doesn’t affect performance too much. However, if your cluster is already running at its limits, you might consider scaling it up. After scaling, you can try the backup again.

@Hazel We haven’t taken backup of this cluster yet hence unsure about backup ETA. The main challenge is that even though the backup is completed I can only enable the GC back only after the restore of the replica server is completed and TiCDC is in sync.