Ensuring Data Consistency Between TiDB Disaster Recovery Clusters

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB 灾备集群之间的数据一致性保证

| username: residentevil

[TiDB Usage Environment] Production Environment
[TiDB Version] V6.1.7
[Encountered Problem: Problem Phenomenon and Impact] For the online system, although TiDB has various high availability capabilities at the overall architecture level (RAFT protocol multi-replica, etc.), it still cannot guarantee the absence of cluster disasters (such as data center failures). Therefore, we want to implement an architecture similar to MySQL’s primary-secondary setup. If the primary region fails, the secondary region’s cluster can still handle online services. The current solution we are considering is using TiCDC to achieve data synchronization between the primary and secondary clusters. But how can we ensure data consistency between the primary and secondary clusters? Does TiCDC have the capability to periodically verify data?

| username: Fly-bird | Original post link

Consider a two data center solution in the same city.

| username: tidb菜鸟一只 | Original post link

To verify data, you can use sync-diff-inspector. CDC also has corresponding monitoring that you can check. Additionally, in such cases, it is indeed recommended to use dual centers in the same city. With CDC, synchronization will definitely have some delay.

| username: 啦啦啦啦啦 | Original post link

For remote disaster recovery, you need to consider RTO and PRO. TiCDC synchronization definitely has latency, and in the event of a real disaster, there will be data loss. Regular data verification is not very realistic unless you manually use sync-diff-inspector to verify. For example, even MySQL master-slave cannot guarantee data consistency forever. You can verify once during the initial synchronization and then set it to read-only to prevent inconsistencies caused by manual writes.

| username: xfworld | Original post link

TiCDC does not have the capability to periodically verify data…

It is only responsible for handling data synchronization events and is constrained by TSO and GC, which also have cycles. Therefore, it is necessary to ensure that the TiCDC cluster is alive within the cycle and can handle these data change events properly…

If data verification needs to be considered, the process and method of handling it during business switching need to be taken into account. The tool is another layer of implementation…

| username: 像风一样的男子 | Original post link

If you need to consider disaster recovery and ensure no data loss, you can only use the two-site three-center solution with KV synchronization replicas to guarantee data consistency. Other master-slave setups will have data loss issues, and if you’re unlucky and encounter a large transaction stuck, the amount of data lost can be significant. The two-site three-center solution has very high network requirements and is costly.

| username: residentevil | Original post link

Sensitive to time consumption.

| username: residentevil | Original post link

I have also looked into the two-site three-center solution, and it is not very suitable. It is still more appropriate to use the primary-backup solution.

| username: residentevil | Original post link

The primary-secondary delay is acceptable since it is not strong synchronization. The main concern is that CDC might introduce data inconsistency issues, especially with DDL operations.

| username: 像风一样的男子 | Original post link

Are you planning to implement it for real or is it just a proposal in a PowerPoint presentation?

| username: residentevil | Original post link

It must be truly implemented, haha.

| username: 像风一样的男子 | Original post link

Using CDC for master-slave replication is a mature solution that won’t lose data. Additionally, I have implemented a local binlog logging solution to achieve real-time incremental backups and scheduled full backups.

| username: tidb菜鸟一只 | Original post link

Generally, the main issue is latency. If it’s the same version, DDL operations won’t be inconsistent.

| username: zhanggame1 | Original post link

| username: residentevil | Original post link

Excellent

| username: residentevil | Original post link

Do you have deployment documentation? For primary and standby clusters synchronized through CDC, I understand that the standby cluster can be restored from the primary’s backup and then establish synchronization using CDC at a specified point in time. Additionally, is it possible to create a new empty standby cluster and directly use CDC for full + incremental migration to achieve deployment?

| username: 像风一样的男子 | Original post link

You can set up a test yourself and you’ll know.

| username: residentevil | Original post link

I’ll take a look, haha, thank you.

| username: Soysauce520 | Original post link

You can check the binlog synchronization in version 6.1.7.

| username: zhanggame1 | Original post link

It should not work.