TiDB Replication Setup

Pranit_Rahangdale · January 13, 2025, 11:39am

Application environment:

Production

TiDB version: TiDB-v4.0.14

Problem:

We need to back up over 50TB of data from a TiDB cluster and restore it to a new cluster as part of our disaster recovery strategy. When using BR for backups, we either need to set tidb_gc_enable=FALSE or increase the garbage collection duration significantly to ensure it activates only after TiCDC has fully synchronized. However, disabling garbage collection impacts production performance and storage utilization, which isn’t ideal given the time and resources required for such a large-scale backup and restore process. Is there an alternative approach that avoids these challenges while ensuring efficient and reliable data transfer?
Please someone suggest a strategy to setup this replication without impacting the Production performance and without compromising the storage utilization.

TiDB-AI-Assistant · January 13, 2025, 11:40am

To achieve a reliable disaster recovery strategy for a TiDB-v4.0.14 cluster with over 50TB of data, follow these steps:

Full Backup with BR:
To perform a full backup of a TiDB-v4.0.14 cluster with over 50TB of data using BR (Backup & Restore), it is crucial to plan carefully to minimize the impact on production performance and manage storage utilization effectively. Below is a detailed plan:

1. Preparation

a. Assess the Environment

Cluster Size and Configuration: Understand the current cluster setup, including the number of nodes, storage configuration, and network bandwidth.
Data Growth Rate: Estimate the data growth rate to ensure the backup plan accommodates future data increases.

b. Backup Storage

Storage Capacity: Ensure that the backup storage has sufficient capacity to handle over 50TB of data. Consider using scalable storage solutions like Amazon S3, Google Cloud Storage, or HDFS.
Network Bandwidth: Verify that the network bandwidth between the TiDB cluster and the backup storage is adequate to handle the data transfer within the desired time frame.

2. Backup Strategy

a. Schedule Backups During Off-Peak Hours

Identify Off-Peak Hours: Analyze the cluster’s usage patterns to determine the least busy times, typically during late nights or weekends.
Automate Scheduling: Use cron jobs or a task scheduler to automate the backup process during these off-peak hours.

b. Use Rate Limiting

Rate Limiting with BR: Use the --ratelimit option to limit the backup speed to 100MB/s. This helps reduce the load on the network and storage systems, minimizing the impact on production performance.

3. Execution

a. Initiate the Backup

Command: Use the following command to start the backup process:
```
br backup full --pd ${PD_ADDRESS} --storage ${BACKUP_STORAGE_PATH} --ratelimit 100MB
```
Replace ${PD_ADDRESS} with the address of the PD server and ${BACKUP_STORAGE_PATH} with the path to the backup storage.

b. Monitor the Backup Process

Logging and Alerts: Enable detailed logging and set up alerts to monitor the backup process. This helps in quickly identifying and resolving any issues that arise.

4. Post-Backup Considerations

a. Verify Backup Integrity

Checksum Verification: After the backup is complete, verify the integrity of the backup data using checksums to ensure data consistency.

b. Storage Management

Retention Policy: Implement a retention policy to manage backup storage utilization. Regularly delete old backups that are no longer needed.
Compression: Consider compressing the backup data to save storage space, if supported by the backup storage solution.

5. Performance and Impact Analysis

a. Performance Testing

Impact Assessment: Conduct performance testing during the backup process to assess its impact on the production environment. Adjust the rate limit or backup schedule as necessary based on the results.

b. Continuous Improvement

Feedback Loop: Gather feedback from the operations team and continuously refine the backup process to improve efficiency and reduce impact on production.

By following this detailed plan, you can effectively perform a full backup of a TiDB-v4.0.14 cluster with over 50TB of data while minimizing the impact on production performance and managing storage utilization efficiently.

Incremental Replication with TiCDC:
To replicate incremental changes from a TiDB v4.0.14 cluster to a new cluster using TiCDC, you need to carefully plan and configure the replication process. Here is a detailed plan that includes considerations for minimizing network and storage impact:

Step 1: Environment Preparation

Ensure Compatibility: Verify that the target cluster is compatible with TiDB v4.0.14. Check for any version-specific features or configurations that might affect replication.
Network Configuration: Ensure that the network between the source and target clusters is reliable and has sufficient bandwidth to handle the data replication load. Consider using a dedicated network link if possible to minimize latency and packet loss.
Security Setup: Configure secure connections between the clusters. Use TLS/SSL to encrypt data in transit to protect sensitive information.

Step 2: TiCDC Setup

Install TiCDC: Deploy TiCDC on a separate node or nodes to avoid resource contention with the TiDB cluster. Ensure that the TiCDC version is compatible with TiDB v4.0.14.
Configure Sink URI: Use the following format to configure the sink URI for the target MySQL cluster:
```
--sink-uri="mysql://user:password@target-cluster:3306/"
```
Replace user, password, and target-cluster with the appropriate credentials and address for your target cluster.
Resource Allocation: Allocate sufficient CPU and memory resources to the TiCDC nodes to handle the expected load. Monitor resource usage and adjust as necessary.

Step 3: Replication Configuration

Changefeed Creation: Create a changefeed to specify which tables or databases to replicate. Use filters to exclude unnecessary data and reduce network and storage impact.
Initial Snapshot: Decide whether to replicate an initial snapshot of the data. If the target cluster already has the initial data, you can skip this step to save bandwidth.
Batch Size and Rate Limiting: Configure the batch size and rate limiting to control the amount of data sent over the network. Smaller batch sizes can reduce peak network usage but may increase overall replication time.

Step 4: Monitoring and Optimization

Monitor Replication Lag: Use TiCDC monitoring tools to track replication lag and ensure that changes are being applied in a timely manner.
Optimize Network Usage: If network usage is a concern, consider using compression to reduce the size of data being transferred. This can be configured in the TiCDC settings.
Storage Management: Monitor storage usage on the target cluster. Implement data retention policies to manage disk space effectively.

Step 5: Testing and Validation

Test Replication: Before going live, test the replication setup with a subset of data to ensure that everything is working as expected.
Data Consistency Checks: Perform consistency checks to verify that the data in the target cluster matches the source cluster.
Failover and Recovery: Plan for failover and recovery scenarios. Ensure that you have backups and a strategy to handle any replication failures.

By following this plan, you can effectively replicate incremental changes from a TiDB v4.0.14 cluster to a new cluster using TiCDC, while minimizing network and storage impact.

Restoration and Application of Changes:
Restoring a full backup to a new TiDB cluster and applying incremental changes using TiCDC involves several steps to ensure data integrity and availability. Here’s a detailed process:

Step 1: Prepare the New TiDB Cluster

Cluster Setup: Set up a new TiDB cluster with a configuration that matches or exceeds the original cluster’s specifications. Ensure that the new cluster is compatible with the TiDB version used in the backup.
Network and Security Configuration: Ensure that the new cluster is properly networked and secured. Configure firewalls, TLS/SSL, and other security measures to protect data during the restoration and replication processes.

Step 2: Restore the Full Backup

Verify Backup Integrity: Before starting the restoration, verify the integrity of the backup files using checksums or other validation methods to ensure data consistency.
Initiate Restoration: Use the BR tool to restore the full backup to the new TiDB cluster. The command might look like this:
```
br restore full --pd ${NEW_PD_ADDRESS} --storage ${BACKUP_STORAGE_PATH}
```
Replace ${NEW_PD_ADDRESS} with the address of the new PD server and ${BACKUP_STORAGE_PATH} with the path to the backup storage.
Monitor the Restoration Process: Enable detailed logging and set up alerts to monitor the restoration process. This helps in quickly identifying and resolving any issues that arise.

Step 3: Set Up TiCDC for Incremental Changes

Install TiCDC: Deploy TiCDC on a separate node or nodes to avoid resource contention with the new TiDB cluster. Ensure that the TiCDC version is compatible with the TiDB version.
Configure Changefeed: Create a changefeed to capture incremental changes from the original cluster. Use filters to specify which tables or databases to replicate.
Configure Sink URI: Set up the sink URI to point to the new TiDB cluster:
```
--sink-uri="mysql://user:password@new-cluster:3306/"
```
Replace user, password, and new-cluster with the appropriate credentials and address for your new cluster.

Step 4: Apply Incremental Changes

Start Replication: Begin the replication process using TiCDC. Ensure that the changefeed is correctly configured to capture all necessary changes from the original cluster.
Monitor Replication Lag: Use TiCDC monitoring tools to track replication lag and ensure that changes are being applied in a timely manner.
Optimize Performance: Adjust batch sizes and rate limits as necessary to optimize network usage and minimize impact on both the source and target clusters.

Step 5: Validate Data Integrity and Availability

Data Consistency Checks: Perform consistency checks to verify that the data in the new cluster matches the source cluster. This can involve comparing row counts, checksums, or using application-level validation.
Test Application Connectivity: Ensure that applications can connect to the new cluster and that data access patterns are functioning as expected.
Plan for Failover and Recovery: Establish a failover and recovery plan to handle any potential issues during the transition. Ensure that you have backups and a strategy to address any replication failures.

By following these steps, you can restore a full backup to a new TiDB cluster and apply incremental changes using TiCDC, maintaining data integrity and availability throughout the process.

This strategy ensures minimal impact on production performance and efficient storage utilization.

dveeden · January 13, 2025, 2:20pm

If you have a support contract, I would suggest opening a support case. If not, I would consider getting one for your production deployment.
v4.0.14 is 4 years old. I would consider updating to a more recent LTS version. See also TiDB Release Support Policy | TiDB Newer versions might also have features that help you with the issue you mentioned.

Pranit_Rahangdale · January 14, 2025, 5:49am

This has been the production data since long. Changing the version would affect the applications’ connectivity as well. Changing the version of the TiDB is not at all possible in this case.
Please help with some alternative approach to setup this replication.

Hazel · January 14, 2025, 6:31am

Hi @Pranit_Rahangdale Could you please let us know how long your current backup process takes? If the GC is stuck for a day, it generally doesn’t affect performance too much. However, if your cluster is already running at its limits, you might consider scaling it up. After scaling, you can try the backup again.

Pranit_Rahangdale · January 14, 2025, 8:52am

Current backup process takes around a month.
We are attempting to set up replication with TiCDC but are encountering difficulties. Has anyone successfully set up TiCDC replication? If so, could you share the steps you followed? Additionally, are there any workarounds available to configure replication for TiDB version v4.0.14?
Upgrading the version is not an option, and we must continue using version v4.0.14.

Hazel · January 14, 2025, 9:12am

Hi, TiCDC was relatively unstable before version 6.5 and may have some bugs. Since your current TiDB version is quite old and you don’t plan to upgrade, I recommend using binlog for replication. Here is the documentation link: TiDB Binlog Overview.
By the way, when you say the backup process takes about a month, are you referring to taking a full backup once a month, or does each backup process take approximately a month to complete?

Pranit_Rahangdale · January 14, 2025, 9:49am

By backup, i mean, one time backup which takes about 20-25 days and then restoring the same on dowstream cluster. Post the restoration of all the data, we need to begin with the real-time replication.
We can enable GC post the restoration is completed which would compromise a lot of space and performance on production environment.

Hazel · January 14, 2025, 10:28am

Hi, the backup is currently taking a long time. Would you be willing to provide the backup logs and monitoring data? This will help us take a look.

Pranit_Rahangdale · January 14, 2025, 10:35am

Well, haven’t yet taken the backup on production environment for such huge amount of data as ultimately this would require disabling of GC for a successful/consistent replication setup.

Pranit_Rahangdale · January 14, 2025, 10:43am

The documentation : TiDB Binlog Overview
mentions incompatibilities of various kinds for version v5.0 , v5.1 , v4.0.7 etc.
Would this work for tidb version v4.0.14? Has anyone worked with TiDB Binlog based replication earlier? If yes, can you provide the steps for the same. Can you please mention the challenges that are faced during TiDB replication using Binlog?

Hazel · January 14, 2025, 10:55am

Binlog is supported in v4.0.14. If you haven’t started the backup yet, may I kindly ask how you estimated that the backup would take 20-25 days? Also, I would like to remind you that TiDB v4 is no longer under active maintenance.

Pranit_Rahangdale · January 14, 2025, 11:17am

Is there a process on how to enable binlog replication in TiDB v4.0.14. If yes, can you please provide with the steps to be followed and preventive measures to be taken care of for the same.

Hazel · January 16, 2025, 6:14am

Hi Pranit, you can take a look at the content in this document. It contains the steps for binlog replication. Although the v4 documentation has been taken down, the steps are the same.

Btw, would you be willing to share your email, location, and company name with us? Since your data volume is large and your version is quite old, it would be best if we could contact our local technical team to review the solution based on your current situation. We can then help you find the best approach.

Pranit_Rahangdale · January 20, 2025, 6:00am

Email: prahangdale1803@gmail.com
Location: Gurgaon
Wont be able to provide the organisation details as of now.
Please let me know if any more info required.

Pranit_Rahangdale · January 20, 2025, 7:26am

Just need one help.
Please ensure, do we need to disable gc during binlog based replication or not?

Hazel · January 22, 2025, 8:55am

Hi @Pranit_Rahangdale you don’t need to disable GC; binlog will be disabled as needed. You can start the binlog pump before the BR backup and start the binlog drainer after the backup is completed.