When DM pulls binlog, it automatically fills in some content, causing the task to hang indefinitely

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: DM拉取binlog时候,会自动填充一些内容,导致任务一直hang住

| username: dba-kit

Version: 6.1.1
Phenomenon:
As shown in the figure below, the task keeps reporting runtime error: invalid memory address or nil pointer dereference. This error will automatically retry, but it keeps reporting the same issue.

Reason:
Looking at the dm-worker log file, you can see errors being reported. Although the task has synchronized to "syncerBinlog": "(mysql-bin|000001.001224, 905172217)", an error occurs when parsing the binlog file because DM cannot recognize some content it filled in itself.

[2022/11/03 16:06:15.948 +08:00] [WARN] [syncer.go:2776] ["found error when get sql_mode from binlog status_vars"] [task=fund_task] [unit="binlog replication"] [error="Q_SQL_MODE_CODE not found in status_vars []"]
[2022/11/03 16:06:15.949 +08:00] [INFO] [ddl.go:37] ["parse ddl"] [task=fund_task] [unit="binlog replication"] [event=query] ["query event context"="{schema: , originSQL: # dummy query generated by DM, often used to fill a hole in a binlog file, startLocation: position: (mysql-bin|000001.001224, 394), gtid-set: 1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20853306,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1494434753,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1, currentLocation: position: (mysql-bin|000001.001224, 905172217), gtid-set: , lastLocation: position: (mysql-bin|000001.001224, 905172217), gtid-set: 1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20853306,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1494434753,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1, re-sync: , needHandleDDLs: , trackInfos: }"]
[2022/11/03 16:06:15.951 +08:00] [ERROR] [syncer.go:1769] ["panic log"] [task=fund_task] [unit="binlog replication"] ["error message"="\"invalid memory address or nil pointer dereference\""] [stack="github.com/pingcap/tiflow/dm/syncer.(*Syncer).Run.func5\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:1769\nruntime.gopanic\n\truntime/panic.go:844\nruntime.panicmem\n\truntime/panic.go:220\nruntime.sigpanic\n\truntime/signal_unix.go:818\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).Run.func8\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:1900\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).Run\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:2338\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).Process\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:702\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).Resume\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:3742"]
[2022/11/03 16:06:15.951 +08:00] [INFO] [syncer.go:3619] ["flush all jobs"] [task=fund_task] [unit="binlog replication"] ["global checkpoint"="{{{mysql-bin|000001.001224 905172217} 0xc05764a908 0} <nil>}(flushed {{{mysql-bin|000001.001224 905172217} 0xc05764a908 0} <nil>})"] ["flush job seq"=1]
[2022/11/03 16:06:15.952 +08:00] [INFO] [syncer.go:1531] ["received ungraceful exit ctx, exit now"] [task=fund_task] [unit="binlog replication"]
[2022/11/03 16:06:15.953 +08:00] [INFO] [subtask.go:356] ["unit process returned"] [subtask=fund_task] [unit=Sync] [stage=Paused] [status="{\"totalEvents\":6627,\"totalTps\":1656,\"syncerBinlog\":\"(mysql-bin|000001.001224, 905172217)\",\"syncerBinlogGtid\":\"1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20853306,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1494434753,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1\",\"binlogType\":\"local\"}"]
[2022/11/03 16:06:15.953 +08:00] [INFO] [local_reader.go:782] ["binlog reader closing"] [task=fund_task] [unit="binlog replication"] [component="binlog reader"]
[2022/11/03 16:06:15.953 +08:00] [WARN] [local_reader.go:288] ["parse relay finished"] [task=fund_task] [unit="binlog replication"] [component="binlog reader"] [error="parse relay log file mysql-bin.001224 from offset 4 in dir relay_log/dm-172.30.241.143-18262/7e04612e-1c6a-11ed-ac0c-52540048a565.000001: context canceled"]
[2022/11/03 16:06:15.953 +08:00] [INFO] [local_reader.go:788] ["binlog reader closed"] [task=fund_task] [unit="binlog replication"] [component="binlog reader"]
[2022/11/03 16:06:16.031 +08:00] [ERROR] [subtask.go:377] ["unit process error"] [subtask=fund_task] [unit=Sync] ["error information"="ErrCode:36001 ErrClass:\"sync-unit\" ErrScope:\"internal\" ErrLevel:\"high\" Message:\"panic error: runtime error: invalid memory address or nil pointer dereference\" "]
[2022/11/03 16:06:16.031 +08:00] [INFO] [subtask.go:379] [paused] [subtask=fund_task] [unit=Sync]

It should be related to the binlog file format. When parsing the binlog file, it is indeed found that DM filled some zeros into the binlog file.

| username: lance6716 | Original post link

Could you please upload the complete log of this worker? If there are multiple log files, please include the one that contains “hole exist from pos1 to pos2”.

| username: dba-kit | Original post link

Found the original binlog from the MySQL backup to replace the binlog file filled by DM, and solved the problem first.

| username: dba-kit | Original post link

The entire log file is too large. I have extracted the part of the log near the error “hole exist from pos1 to pos2.” Please check if this is sufficient.
masked.log (429.1 KB)

| username: dba-kit | Original post link

Additionally, does DM have a parameter similar to TiDB that can automatically clean logs? Currently, we manually delete log files periodically. :thinking:

| username: lance6716 | Original post link

It should be enough, thank you for the feedback. Let’s analyze it.

Did you manually switch the relay at 8:56 AM?

| username: lance6716 | Original post link

Currently, there isn’t one. What is the approximate log generation speed, and does it consume a lot of disk space?

| username: dba-kit | Original post link

Yes, because occasionally there are situations where the task cannot recognize the relay-log, it is often necessary to resume the task to recover. So, I executed the steps in order: stop-task, stop-relay, start-relay, start-task.

| username: dba-kit | Original post link

The log level is set to info, but commit is being recognized as DDL, and 80% of the logs are of this type:

[2022/10/31 23:56:30.713 +08:00] [INFO] [ddl.go:37] ["parse ddl"] [task=fund_task] [unit="binlog replication"] [event=query] ["query event context"="{schema: , originSQL: COMMIT, startLocation: position: (mysql-bin|000001.001212, 235036967), gtid-set: 1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20525508,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1489357604,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1, currentLocation: position: (mysql-bin|000001.001212, 235037031), gtid-set: 1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20525508,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1489357604,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1, lastLocation: position: (mysql-bin|000001.001212, 235036967), gtid-set: 1e787e9a-ea02-11eb-8239-00163e0eb750:1-136252793,1fd7fe20-ea02-11eb-9db6-00163e02d592:1-20525508,3e773ad8-46c1-11ec-992b-0242a9fe7b02:1-139,777234c5-87b7-11ea-9279-00163e1054e6:1-1489357604,784b7fc9-87b7-11ea-a0b7-00163e068fb5:1-1724587847,7e04612e-1c6a-11ed-ac0c-52540048a565:1, re-sync: , needHandleDDLs: , trackInfos: }"]

There are over ten gigabytes of logs every day.

| username: lance6716 | Original post link

What is the upstream database you are referring to, is it a certain RDS? And is the storage engine InnoDB?

Typically, COMMIT does not appear in the binlog as a DDL. We can optimize this based on your scenario.

| username: dba-kit | Original post link

DM pulls binlog from the community version of MySQL, but the data source is RDS, and the self-built MySQL is a replica of RDS.

| username: lance6716 | Original post link

Strange, what version of MySQL is this community edition? Is there any table storage engine using MyISAM?

| username: dba-kit | Original post link

It’s version 5.7, and there are no MyISAM tables.

| username: dba-kit | Original post link

Similar warnings include:

401662:[2022/11/04 10:35:58.448 +08:00] [WARN] [checkpoint_flush_worker.go:112] ["sync flush post-process(afterFlushFn) failed"] [task=qieman_task] [unit="binlog replication"] [error="[code=28005:class=relay-event-lib:scope=internal:level=high], Message: there aren't any data under relay log directory relay-dir., Workaround: Please check relay log using query-status."] [errorVerbose="[code=28005:class=relay-event-lib:scope=internal:level=high], Message: there aren't any data under relay log directory relay-dir., Workaround: Please check relay log using query-status.\ngithub.com/pingcap/tiflow/dm/pkg/terror.(*Error).Generate\n\tgithub.com/pingcap/tiflow/dm/pkg/terror/terror.go:271\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).updateActiveRelayLog\n\tgithub.com/pingcap/tiflow/dm/syncer/relay.go:116\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).afterFlushCheckpoint\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:1260\ngithub.com/pingcap/tiflow/dm/syncer.(*checkpointFlushWorker).Run\n\tgithub.com/pingcap/tiflow/dm/syncer/checkpoint_flush_worker.go:111\ngithub.com/pingcap/tiflow/dm/syncer.(*Syncer).Run.func3\n\tgithub.com/pingcap/tiflow/dm/syncer/syncer.go:1655\nruntime.goexit\n\truntime/asm_amd64.s:1571"]

This will cause the task to be in a paused state. This state appears irregularly, and generally, manually executing resume-task can resolve it. When this error occurs, the task status is:

| username: dba-kit | Original post link

However, looking at the replication delay, there is basically no delay, so I suspect that occasionally the task gets the latest position, but the binlog in the relay-log has not been read yet.

| username: dba-kit | Original post link

Or could it be related to the adjustment of the responsible party for relay-log fetching in version 6.1?

| username: lance6716 | Original post link

This kind of issue requires logs to troubleshoot.

| username: Hacker_xUwtuKxa | Original post link

How do you know which binlog to pull from MySQL? Manually parsing the binlog logs? That’s quite a workload.

| username: Hacker_xUwtuKxa | Original post link

Additionally, is this bug only present in version 6.1? Does it occur in version 5.4?

| username: dba-kit | Original post link

In the DM task error log, there is an error indicating which binlog has an issue. The binlog name is the same as the binlog name on MySQL, so you can directly transfer the binlog with the same name from the upstream MySQL to the corresponding relay-log directory of the dm-worker.