Sync_diff Data Validation Error

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: sync_diff数据校验报错

| username: 胡杨树旁

I encountered an error when trying to verify the data between TiDB and TiDB while executing the command:


Upstream TiDB version: 4.0.8 Downstream TiDB version: 7.1.1

| username: 胡杨树旁 | Original post link

# Diff Configuration.

######################### Global config #########################

# Number of threads to check data, the number of connections to the upstream and downstream databases will be slightly larger than this value
check-thread-count = 4

# If enabled, if there is an inconsistency in the table, SQL statements for fixing will be output.
export-fix-sql = true

# Only compare table structure without comparing data
check-struct-only = false

######################### Datasource config #########################
[data-sources]
[data-sources.mysql1] # mysql1 is a custom id that uniquely identifies this database instance, used in task.source-instances/task.target-instance below
    host = "*****"
    port = 4000
    user = "root"
    password = "****" # Set the password to connect to the upstream database, can be plaintext or Base64 encoded.

    # (Optional) Use mapping rules to match multiple upstream sharded tables, where rule1 and rule2 are defined in the Routes configuration section below
    route-rules = ["rule1"]

[data-sources.tidb0]
    host = "****"
    port = 8000
    user = "****"
    password = "****" # Set the password to connect to the downstream database, can be plaintext or Base64 encoded.

    # (Optional) Use TLS to connect to TiDB
    # security.ca-path = ".../ca.crt"
    # security.cert-path = ".../cert.crt"
    # security.key-path = ".../key.crt"

    # (Optional) Use TiDB's snapshot feature, if enabled, historical data will be used for comparison
    # snapshot = "386902609362944000"
    # When snapshot is set to "auto", use the synchronization time point of TiCDC in the upstream and downstream, refer to <https://github.com/pingcap/tidb-tools/issues/663> for details
    # snapshot = "auto"

########################### Routes ###########################
# If you need to compare data from a large number of tables with different database names or table names, or to verify data from multiple upstream sharded tables with a downstream consolidated table, you can set up mapping relationships through table-rule
# You can configure only the schema or table mapping relationship, or both
#[routes]
#[routes.rule1] # rule1 is a custom id that uniquely identifies this configuration, used in data-sources.route-rules above
#schema-pattern = "ceshi"      # Match the database name of the data source, supports wildcards "*" and "?"
#table-pattern = "insurance_area_face_mode"          # Match the table name of the data source, supports wildcards "*" and "?"
#target-schema = "epp"         # Target database name
#target-table = "insurance_area_face_mode" # Target table name

#[routes.rule2]
#schema-pattern = "test2_*"      # Match the database name of the data source, supports wildcards "*" and "?"
#table-pattern = "t2_*"          # Match the table name of the data source, supports wildcards "*" and "?"
#target-schema = "test2"         # Target database name
#target-table = "t2" # Target table name

######################### Task config #########################
# Configure the tables in the *target database* that need to be compared
[task]
    # output-dir will save the following information
    # 1 sql: SQL files generated after errors are found, and one chunk corresponds to one file
    # 2 log: sync-diff.log saves log information
    # 3 summary: summary.txt saves the summary
    # 4 checkpoint: a dir saves breakpoint resume information
    output-dir = "./output"

    # Upstream database, the content is the unique identifier id declared by data-sources
    source-instances = ["mysql1"]

    # Downstream database, the content is the unique identifier id declared by data-sources
    target-instance = "tidb0"
    # Tables in the downstream database that need to be compared, each table needs to include the database name and table name, separated by `.`
    # Use ? to match any one character; use * to match any; detailed matching rules refer to golang regexp pkg: https://github.com/google/re2/wiki/Syntax
    target-check-tables = ["visadb.vsParamCompany"]
| username: 有猫万事足 | Original post link

route-rules = ["rule1"]

The issue seems to be with this line because the rule1 below has been commented out by you. This should not be associated. Try commenting out this line and see if it works.

| username: 胡杨树旁 | Original post link

Commenting out this line doesn’t work either, it still reports the same error.

| username: 有猫万事足 | Original post link

Your sync_diff version is 4.0, but your sync_diff.toml file content is in the 7.1 format.

The following link is from the sync_diff documentation for version 4.0, and you can see that the configuration file content is written very differently.

######################### Databases config #########################

# Configuration for the source database instance
[[source-db]]
    host = "127.0.0.1"
    port = 3306
    user = "root"
    password = "123456"
    # The id of the source database instance, uniquely identifies a database instance
    instance-id = "source-1"
    # Use TiDB's snapshot feature, if enabled, historical data will be used for comparison
    # snapshot = "2016-10-08 16:45:26"
    # Set the sql-mode of the database, used to parse the table structure
    # sql-mode = ""

# Configuration for the target database instance
[target-db]
    host = "127.0.0.1"
    port = 4000
    user = "root"
    password = "123456"
    # Use TiDB's snapshot feature, if enabled, historical data will be used for comparison
    # snapshot = "2016-10-08 16:45:26"
    # Set the sql-mode of the database, used to parse the table structure
    # sql-mode = ""

So the source-db configuration cannot be recognized at all.

| username: redgame | Original post link

Yes, it is incorrect not to include the configuration file.

| username: 胡杨树旁 | Original post link

Okay, thank you. I didn’t pay attention to this and thought the configuration files were written the same way. I’ll make the changes.

| username: 胡杨树旁 | Original post link

After changing the configuration file, it can be executed, but the table structures of my upstream and downstream are inconsistent, which causes an error. I want to ask, does this tool only verify tables with consistent structures? If the table structures are inconsistent, is there no way to verify them? Currently, the upstream has 12 fields, and the downstream has 21 fields.

| username: longzhuquan | Original post link

Yes, validation essentially involves computing the hash value for the range of columns. Different fields will definitely result in different hash values. However, if I remember correctly, syncdiff has a feature to ignore certain columns through annotations. You can try annotating the extra columns.

| username: 像风一样的男子 | Original post link

There is an “ignore-columns” option in the configuration file that can ignore the comparison of certain columns.

| username: ShawnYan | Original post link

Is there a page where I can check which sync_diff parameters have changed? Besides the Release Notes.

| username: 有猫万事足 | Original post link

I’m not sure. If it weren’t for the error message, I wouldn’t have compared the version parameter changes. I didn’t expect the changes to be so significant.

| username: ShawnYan | Original post link

Moreover, the screenshot shows that this sync diff is 4.0.9, not 4.0.8. The prompt clearly indicates that the configuration item is incorrect. However, it can be confirmed that this is undoubtedly an old user.

| username: 胡杨树旁 | Original post link

Sure, thanks. However, many tables have different numbers of fields, making it a bit difficult to check. In my case, do you have any suggestions for data validation?

| username: longzhuquan | Original post link

No. Unless you only want to verify the count, you can write a count script.

| username: ShawnYan | Original post link

By the way, with such a big version difference, is it to verify the new version or to prepare for an upgrade?

| username: 胡杨树旁 | Original post link

The original business was on version 4.0.8. During the business upgrade, we also applied for new servers, so we deployed the new version. The business wants to switch to the new version.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.