After enabling relay-log in DM 6.1.1, start-task succeeds but relay-log is not written to disk

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: DM 6.1.1开启relay-log后,start-task成功但是relay-log不落盘

| username: dba-kit

【TiDB Usage Environment】Production
【TiDB Version】6.1.1
【Encountered Problem】
The relay-log of dm-worker occasionally does not persist to disk. Using query-status shows that the task status is normal, and synchronization is also normal.
【Reproduction Path】
Unable to reproduce stably, suspect it is related to multiple task failures and multiple executions of resume-task.
【Problem Phenomenon and Impact】

As you can see, the binlog has not continued to persist since 10-21, with the last binlog being mysql-bin.001048.

Here, the task status is normal and has already synchronized to mysql-bin.001114 (there was a primary key conflict downstream, replication stopped for a while, manually executed resume-task to fix it, this issue occurred multiple times, and I executed resume-task multiple times).

| username: dba-kit | Original post link

To add, when this issue occurs, you can see that the relayStatus is indeed null, meaning the relay-log is not actually started. At this point, if you execute resume-relay, it will prompt you that the relay-log has not been started yet, so you need to execute start-relay first;

However, at this time, if you execute query-status, you will find that the task status starts to fail. You must stop-task first and then start-task to regenerate a task.

| username: lance6716 | Original post link

Could you please send us the logs using the clinic service? We need the logs to troubleshoot the issue.

| username: dba-kit | Original post link

I found the cause of the problem. After further investigation, I discovered that none of the data sources had the relay-log enabled. I remembered that during a previous operation, I mistakenly executed stop-relay & stop-task on all data sources. After that, I only executed start-task and did not execute start-relay, so the relay-log was not being pulled in the first place. :skull_and_crossbones:

| username: dba-kit | Original post link

Additionally, here are two SHELL commands to query abnormal relay data source names:

# 1. Query data sources without relay-log enabled
[tidb@prod-tiup-f01 task]$ tiup dmctl --master-addr query-status --more | jq -r '.sources[] | select(.sourceStatus.relayStatus == null) | .sourceStatus.source'
tiup is checking updates for component dmctl ...
Starting component `dmctl`: /home/tidb/.tiup/components/dmctl/v6.1.1/dmctl/dmctl --master-addr query-status --more

# 2. Query data sources that are in an error state
[tidb@prod-tiup-f01 task]$ tiup dmctl --master-addr query-status --more | jq -r '.sources[] | select(.sourceStatus.relayStatus.stage == "Paused") | {src: .sourceStatus.source, worker: .sourceStatus.worker }'
tiup is checking updates for component dmctl ...
Starting component `dmctl`: /home/tidb/.tiup/components/dmctl/v6.1.1/dmctl/dmctl --master-addr query-status --more

Additionally: If you, like me, have previously enabled relay-log and haven’t enabled it for a long time, directly using start-relay at this point is likely to result in an error:

This is because when the dm-worker start-relay, it will request the binlog from MySQL from the last paused point. If the relay-log has been disconnected for a long time, the binlog is likely to have been cleaned up (actually, we don’t need these binlogs, since the task status is normal, these binlogs have already been applied by the task, just not persisted). At this point, you can use the command 2 I provided above to find the dm-worker corresponding to the data source, then go to the deploy directory of DM, delete the relay_log directory, and clean up the previously left metadata. Then stop-relay & start-relay to restart the relay.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.