Issues with DM Migration Errors

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: dm迁移报错问题

| username: xxxxxxxx

[TiDB Usage Environment] Production Environment Poc
[TiDB Version] 4.0.13, 6.1.7, MySQL Version 5.7.21, DM Version 2.0.3

Requirement: Migrate MySQL cluster to TiDB cluster. Tried migrating to 6.1.7 but encountered errors, then tried 4.0.13, also encountered errors, all during the full backup phase. It might not be related to the TiDB cluster version.

The permissions for the sync user are as follows:

grant SELECT, RELOAD, LOCK TABLES, REPLICATION SLAVE, REPLICATION CLIENT on *.* 
grant SELECT, INSERT, CREATE, LOCK TABLES ON db_name.*

Source configuration is as follows:

cat dm-source-oltp-137-db58_zzlogistics.yaml 

source-id: "source-xxxx-xxxx"  # Data source name
 
# Whether DM-worker uses Global Transaction Identifier (GTID) to pull binlog. The prerequisite is that GTID mode is enabled in the upstream MySQL.
enable-gtid: false
 
from:
  host: "xxxx"  # Upstream data source IP and related information
  user: "xxxx"
  password: "xxxx"
  port: xxxx

Task configuration file is as follows:

# Task name, multiple tasks running simultaneously cannot have the same name.
name: "task-xxxx-xxxx"
#shard-mode: "pessimistic"  # For sharding merge, set to pessimistic mode
# Full + incremental (all) migration mode.
task-mode: "all"
# Downstream TiDB configuration information.
target-database:
  host: "xxxx"
  port: xxxx
  user: "xxxx"
  password: "xxxx"
 
# Configuration of all upstream MySQL instances required for the current data migration task.
mysql-instances:
-
  # Upstream instance or replication group ID, reference.
  source-id: "source-xxxx-xxxx"
  # The name of the black and white list configuration item for the database or table to be migrated, used to reference the global black and white list configuration. The global configuration is in the `block-allow-list` configuration below.
  block-allow-list: "global"          # If DM version <= v2.0.0-beta.2, use black-white-list.
  # Configuration item name of the dump processing unit, used to reference the global dump processing unit configuration.
  mydumper-config-name: "global"
  loader-config-name: "global"
  syncer-config-name: "global"
  filter-rules: ["filter-rule-1"]
  route-rules: ["route-rule-1"]     # Routing rules for migrating data source tables to target TiDB tables, multiple rules can be defined
 
# Global configuration of black and white lists, referenced by each instance through configuration item names.
block-allow-list:                     # If DM version <= v2.0.0-beta.2, use black-white-list.
  global:
    do-tables:
    - db-name: "xxxx"
      tbl-name: "*"

filters:                                        # Define rules for filtering specific operations of the data source, multiple rules can be defined
  filter-rule-1:                                # Rule name
    schema-pattern: "xxxx"                       # Match the database name of the data source, supports wildcards "*" and "?"
    table-pattern: "*"                          # Match the table name of the data source, supports wildcards "*" and "?"
    events: ["truncate table","drop table"]    # Operation types of the matched schema-pattern and table-pattern databases or tables
    action: Ignore                              # Migrate (Do) or ignore (Ignore)

routes:                           # Define routing rules for migrating data source tables to target TiDB tables, multiple rules can be defined
  route-rule-1:                   # Rule name
    schema-pattern: "xxxx"      # Match the database name of the data source, supports wildcards "*" and "?"
    target-schema: "xxxx"         # Target TiDB database name

# Global configuration of the dump processing unit, referenced by each instance through configuration item names.
mydumpers:
  global:
      threads: 8                       # Number of threads for the dump processing unit to export data from the upstream database instance, default is 4
      chunk-filesize: 64               # Size of data files generated by the dump processing unit, default is 64, unit is MB
      extra-args: "-B xxxx"   # Other parameters of the dump processing unit. From DM 1.0.2, DM will automatically generate table-list configuration. In earlier versions, manual configuration is still required.

loaders:
  global:
      pool-size: 2

syncers:
  global:
    worker-count: 32
    batch: 100

Problem description: Using DM 2.0.3 version tool for data migration, during the full backup phase, the task status shows the following error:

{
    "result": true,
    "msg": "",
    "sources": [
        {
            "result": true,
            "msg": "",
            "sourceStatus": {
                "source": "source-xxxx-xxxx",
                "worker": "dm-xx.xx.xx.xx-xxxx",
                "result": null,
                "relayStatus": null
            },
            "subTaskStatus": [
                {
                    "name": "task-xxxx-xxxx",
                    "stage": "Paused",
                    "unit": "Dump",
                    "result": {
                        "isCanceled": false,
                        "errors": [
                            {
                                "ErrCode": 32001,
                                "ErrClass": "dump-unit",
                                "ErrScope": "internal",
                                "ErrLevel": "high",
                                "Message": "mydumper/dumpling runs with error, with output (may empty): ",
                                "RawCause": "sql: SELECT COLUMN_NAME,EXTRA FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA=? AND TABLE_NAME=? ORDER BY ORDINAL_POSITION;: driver: bad connection",
                                "Workaround": ""
                            }
                        ],
                        "detail": null
                    },
                    "unresolvedDDLLockID": "",
                    "dump": {

                    }
                }
            ]
        }
    ]
}

According to the official documentation, restarting the task does not solve the problem. After restarting the task, it will redo the full backup and the error will loop. The dm-worker.log error is as follows, showing that it ran for nearly four hours before encountering the error.

[2024/04/11 18:12:09.125 +08:00] [ERROR] [dumpling.go:138] ["dump data exits with error"] [task=task-xxxx-xxxx] [unit=dump] ["cost time"=3h55m25.147905887s] [error="ErrCode:32001 ErrClass:\"dump-unit\" ErrScope:\"internal\" ErrLevel:\"high\" Message:\"mydumper/dumpling runs with error, with output (may empty): \" RawCause:\"sql: SELECT COLUMN_NAME,EXTRA FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA=? AND TABLE_NAME=? ORDER BY ORDINAL_POSITION;: driver: bad connection\" "]
[2024/04/11 18:12:09.125 +08:00] [INFO] [subtask.go:289] ["unit process returned"] [subtask=task-xxxx-xxxx] [unit=Dump] [stage=Paused] [status={}]
[2024/04/11 18:12:09.125 +08:00] [ERROR] [subtask.go:308] ["unit process error"] [subtask=task-xxxx-xxxx] [unit=Dump] ["error information"="{\"ErrCode\":32001,\"ErrClass\":\"dump-unit\",\"ErrScope\":\"internal\",\"ErrLevel\":\"high\",\"Message\":\"mydumper/dumpling runs with error, with output (may empty): \",\"RawCause\":\"sql: SELECT COLUMN_NAME,EXTRA FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA=? AND TABLE_NAME=? ORDER BY ORDINAL_POSITION;: driver: bad connection\"}"]

It has been confirmed that there is no traffic access on the MySQL side, no tasks are killing connections, and no tasks like pt-kill are being executed.

Later, a user suggested upgrading the DM cluster version. After upgrading to version 6.1.7, the error still exists. Although the error is different, it is similar.

                        "errors": [
                            {
                                "ErrCode": 32001,
                                "ErrClass": "dump-unit",
                                "ErrScope": "internal",
                                "ErrLevel": "high",
                                "Message": "mydumper/dumpling runs with error, with output (may empty): ",
                                "RawCause": "sql: SHOW COLUMNS FROM `db58_zzlogistics`.`logistics_base_0`: driver: bad connection",
                                "Workaround": ""
                            }
                        ],

Also tried manually doing a select sleep(100000000) during the full backup period. DM encountered an error after running for more than an hour (the backup file is also about 1.3T, with concurrency of 2 in version 2.0.3 and 8 in version 6.1.7, so the execution time was reduced by about 3/4). This also indirectly indicates that the DM connection exception is probably not caused by MySQL killing it, and the sleep query from previous tests is still there.

[2024/04/12 16:24:01.300 +08:00] [ERROR] [dumpling.go:152] ["dump data exits with error"] [task=task-xxxx-xxxx] [unit=dump] ["cost time"=1h5m36.89669174s] [error="ErrCode:32001 ErrClass:\"dump-unit\" ErrScope:\"internal\" ErrLevel:\"high\" Message:\"mydumper/dumpling runs with error, with output (may empty): \" RawCause:\"sql: SHOW COLUMNS FROM `d
xxxx`.`logistics_base_0`: driver: bad connection\" "]
[2024/04/12 16:24:01.300 +08:00] [INFO] [subtask.go:356] ["unit process returned"] [subtask=task-xxxx-xxxx] [unit=Dump] [stage=Paused] [status="{\"totalTables\":41,\"completedTables\":1,\"finishedBytes\":1314877011547,\"finishedRows\":413061038,\"estimateTotalRows\":350604900}"]
[2024/04/12 16:24:01.886 +08:00] [ERROR] [subtask.go:377] ["unit process error"] [subtask=task-xxxx-xxxx] [unit=Dump] ["error information"="ErrCode:32001 ErrClass:\"dump-unit\" ErrScope:\"internal\" ErrLevel:\"high\" Message:\"mydumper/dumpling runs with error, with output (may empty): \" RawCause:\"sql: SHOW COLUMNS FROM `db58_zzlogistics`.`logis
tics_base_0`: driver: bad connection\" "]

The SQL indicated in the error messages can be executed normally when logged into MySQL with the sync user. The entire process works fine for a single small table, but there are issues when migrating the entire database.

| username: Billmay表妹 | Original post link

You can try upgrading the DM version to the latest one, for example, using DM version 7.5.1.

| username: db_user | Original post link

Check the upstream and downstream to see if there is a max_execution_time configuration, show variables like ‘%max_execution_time%’

| username: tidb狂热爱好者 | Original post link

Your error is due to a failure in dumping data. The first step didn’t go through.

| username: WalterWj | Original post link

Try executing it manually on the upstream MySQL.

| username: xxxxxxxx | Original post link

Equipped with this parameter, we did not configure it in MySQL. When using version 2.0.3, it used 2 concurrent threads and reported an error after 3 hours and 55 minutes. When using version 6.1.7, it used 8 concurrent threads and reported an error after 1 hour and 5 minutes. This indirectly indicates that this should not be a MySQL kill connection.

| username: tidb狂热爱好者 | Original post link

Try using tiup dumping to manually dump the data and see if it works. I have encountered issues before where MySQL couldn’t export data due to not using indexes.

| username: tidb狂热爱好者 | Original post link

show create table db58_zzlogistics.logistics_base_0

| username: db_user | Original post link

Take a look at this, first check the TiDB logs and monitoring to see if there are any restarts.

| username: dba远航 | Original post link

It has already indicated a connection error. Check the IP, port, permissions, etc., in the configuration file.