Data Synchronization and Consistency Check Between Two TiDB Clusters

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 两个TiDB集群之间数据同步,数据一致性检测

| username: liarby

【TiDB Usage Environment】Production\Test Environment\POC
【TiDB Version】v5.4.2
【Encountered Problem】Using one cluster for backup and restore to initialize data, and synchronizing incremental data through binlog. Currently, the synchronization status between the two clusters is unknown, and it is unclear whether the incremental data synchronization is complete.
【Reproduction Path】

  1. Backup data on the old cluster;
  2. Restore data on the new cluster;
  3. Deploy drainer on the old cluster to synchronize incremental data to the new cluster.
    【Problem Phenomenon and Impact】How to determine if the incremental data synchronization is complete.
    【Attachment】
    Below is the data from the new cluster’s tidb_binlog.checkpoint:
    mysql> select * from checkpoint;
    ±--------------------±---------------------------------------------------------------------------------------------------------------------------------------------+
    | clusterID | checkPoint |
    ±--------------------±---------------------------------------------------------------------------------------------------------------------------------------------+
    | 7141199591139448285 | {“consistent”:true,“commitTS”:435955253777793025,“ts-map”:{“master-ts”:435899089729617923,“slave-ts”:435899515558690818},“schema-version”:0} |
    ±--------------------±---------------------------------------------------------------------------------------------------------------------------------------------+
  • Related logs, configuration files, Grafana monitoring (https://metricstool.pingcap.com/)
  • TiUP Cluster Display information
  • TiUP Cluster Edit config information
  • TiDB-Overview monitoring
  • Corresponding module’s Grafana monitoring (if any, such as BR, TiDB-binlog, TiCDC, etc.)
  • Corresponding module logs (including logs from 1 hour before and after the issue)

If the question is related to performance optimization or fault troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results for upload.

| username: Ming | Original post link

Use the tidb_parse_tso parameter to convert the timestamp.

| username: Ming | Original post link

You can check the time difference of synchronization between upstream and downstream clusters.

| username: 啦啦啦啦啦 | Original post link

sync-diff-inspector

| username: xiaohetao | Original post link

TiDB 5.0 does not recommend using binlog for synchronization; you can use CDC for synchronization.

| username: liarby | Original post link

Thank you. This is a good method for querying the time difference between master and slave synchronization, but it has a problem. It can only show that the time is consistent, and it is accurate to the second. Even if the time is consistent, it cannot directly indicate that the data of the two clusters is consistent.

It would be better if it were something like master-ts=slave-ts, where there is a place showing which TSO the slave has synchronized to from the master. However, it is obvious that the TSOs of the two clusters are very far apart.

| username: liarby | Original post link

Please correct your SQL: select timediff(now(),tidb_parse_tso(substr(checkpoint,31,18))) from tidb_binlog.checkpoint;

| username: liarby | Original post link

I am planning to use CDC. I upgraded from version 4.0.

| username: liarby | Original post link

Thank you, currently using this tool.

| username: Ming | Original post link

Next, use sync-diff-inspector to perform data validation.

| username: Ming | Original post link

Ah, I see it now, it’s fine. :+1:

| username: liarby | Original post link

I think it can be done like this:

  1. Use select timediff(now(),tidb_parse_tso(substr(checkpoint,31,18))) from tidb_binlog.checkpoint; to confirm that the incremental data of the replica has been mostly synchronized (if the database is large, it may take a long time for the primary database to back up and the replica to restore);
  2. Stop writing to the primary database;
  3. Use sync-diff-inspector to perform data consistency checks.
| username: Ming | Original post link

You can directly compare based on the value of the checkpoint. As long as this part of the data is consistent, you can compare it again after a period of time. There’s no need to stop writing to the primary database, right?

| username: liarby | Original post link

If it doesn’t stop, it will continue to write, and it will never reach consistency.

| username: liarby | Original post link

Oh, I forgot to mention, I want to switch the database after confirming data consistency.

| username: Ming | Original post link

Hmm, let me ask:
What are the upstream and downstream doing respectively? Is the upstream currently handling business?
Is the downstream currently synchronizing without any business?
Do you want to switch the business from upstream to downstream?

| username: liarby | Original post link

Two data centers, the upstream has business operations, and the downstream is a newly built cluster. Backup recovery and incremental synchronization need to be performed.

| username: Ming | Original post link

Data verification takes a certain amount of time. Can you accept stopping business writes within this time frame? Shouldn’t you verify first, then stop writes and switch?

| username: liarby | Original post link

Verification requires stopping the business first to ensure that neither side is writing.

| username: Ming | Original post link

You can specify a snapshot, take a range for data validation, and then stop the business and switch the database after the validation is complete. This way, the downtime for the business will be particularly short. However, in this case, any data written after the data validation starts will not be known if it is consistent.