TiDB DM startup, worker node fails to dial dm-master

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDBit DM启动,worker 节点 fail to dial dm-master

| username: opkcloud

[TiDB Usage Environment] Test
[TiDB Version] V7.4.0
[Reproduction Path] Occurs when initially starting with tiup dm start dm-test
[Encountered Issue: Symptoms and Impact] Worker node fails to dial dm-master, DM cluster startup error
[Resource Configuration] 2 cores, 8GB
[Attachments: Screenshots/Logs/Monitoring]

Log 1:

[2023/10/16 11:53:08.137 +08:00] [ERROR] [join.go:65] ["fail to dial dm-master"] [endpoint=http://43.138.205.213:8261] [error="context deadline exceeded"]
[2023/10/16 11:53:08.137 +08:00] [INFO] [main.go:71] ["join the cluster meet error"] [error="[code=40077:class=dm-worker:scope=internal:level=high], Message: cannot join with master endpoints: [http://43.138.205.213:8261], error: context deadline exceeded, Workaround: Please check network connection of worker and check worker name is unique."] [errorVerbose="[code=40077:class=dm-worker:scope=internal:level=high], Message: cannot join with master endpoints: [http://43.138.205.213:8261], error: context deadline exceeded, Workaround: Please check network connection of worker and check worker name is unique.\ngithub.com/pingcap/tiflow/dm/pkg/terror.(*Error).Generate\n\tgithub.com/pingcap/tiflow/dm/pkg/terror/terror.go:293\ngithub.com/pingcap/tiflow/dm/worker.(*Server).JoinMaster\n\tgithub.com/pingcap/tiflow/dm/worker/join.go:86\nmain.main\n\tgithub.com/pingcap/tiflow/cmd/dm-worker/main.go:69\nruntime.main\n\truntime/proc.go:267\nruntime.goexit\n\truntime/asm_amd64.s:1650"]

Log 2:

|2023-10-16T11:46:17.738+0800|DEBUG|retry error|{error: operation timed out after 2m0s}|
|---|---|---|---|
|2023-10-16T11:46:17.738+0800|DEBUG|TaskFinish|{task: StartCluster, error: failed to start dm-worker: failed to start: 43.138.205.213 dm-worker-8265.service, please check the instance's log(/home/tidb/dm/deploy/dm-worker-8265/log) for more detail.: timed out waiting for port 8265 to be started after 2m0s, errorVerbose: timed out waiting for port 8265 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:123\ngithub.com/pingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:157\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:405\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:534\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650\nfailed to start: 43.138.205.213 dm-worker-8265.service, please check the instance's log(/home/tidb/dm/deploy/dm-worker-8265/log) for more detail.\nfailed to start dm-worker}|
|2023-10-16T11:46:17.738+0800|INFO|Execute command finished|{code: 1, error: failed to start dm-worker: failed to start: 43.138.205.213 dm-worker-8265.service, please check the instance's log(/home/tidb/dm/deploy/dm-worker-8265/log) for more detail.: timed out waiting for port 8265 to be started after 2m0s, errorVerbose: timed out waiting for port 8265 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:123\ngithub.com/pingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:157\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:405\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:534\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650\nfailed to start: 43.138.205.213 dm-worker-8265.service, please check the instance's log(/home/tidb/dm/deploy/dm-worker-8265/log) for more detail.\nfailed to start dm-worker}|

DM Configuration:

| username: Fly-bird | Original post link

Can the upstream configured in DM connect normally?

| username: tidb菜鸟一只 | Original post link

Check if the master nodes are up by using the command tiup dm display dm-test.

| username: opkcloud | Original post link

tiup dm start dm-test, the DM cluster reported an error during startup, and it hasn’t reached the step of connecting to the upstream yet.

| username: opkcloud | Original post link

When executing the command tiup dm start dm-test to start the DM cluster, the master started successfully in step 1, but an error occurred when starting the worker nodes in step 2.

Therefore, when executing the command tiup dm display dm-test, the master status is shown as DOWN.

| username: TiDBer_小阿飞 | Original post link

[code=40066:class=dm-worker:scope=internal:level=high] ExecuteDDL timeout, try using query-status to check whether the DDL is still blocking

| username: TiDBer_小阿飞 | Original post link

| username: opkcloud | Original post link

Is this the error reported by your cluster?

| username: opkcloud | Original post link

My error code is 40077.

| username: tidb菜鸟一只 | Original post link

You succeeded in the first step, the master should have started. However, your worker is reporting an error that it cannot connect to the master, which means the master might not have started. Check the master logs.

| username: 有猫万事足 | Original post link

Did you bind a public IP address to DM?
Then 8621 definitely won’t be accessible to you.

| username: opkcloud | Original post link

43.138.205.213 is a public address, it seems that port 8621 is not open, I’ll give it a try.

| username: 有猫万事足 | Original post link

No, no, no, it can’t be placed on the public network, it’s very unsafe. :joy:

Moreover, public network traffic is charged. If it’s bound to an internal network address, it’s free and safe. Isn’t that wonderful?

| username: opkcloud | Original post link

The issue has been resolved, it was due to the public network port not being open.

| username: tidb菜鸟一只 | Original post link

Awesome, I didn’t expect it to use the public address directly.

| username: TiDBer_小阿飞 | Original post link

Uh… public address… :call_me_hand: :call_me_hand: :call_me_hand:

| username: xingzhenxiang | Original post link

I think solving it without using the public network would be the best.

| username: 舞动梦灵 | Original post link

Flipped.