A worker in DM cannot be restarted

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: dm中的一个worker重新启动不了

| username: Jjjjayson_zeng

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]


One worker is in a semi-dead state,

Looking at the logs, there are no errors reported, and the CPU on the top server is not fully utilized.

| username: WalterWj | Original post link

I don’t think it’s that it can’t start up, it’s just that there are no tasks running on it, right?

| username: Jjjjayson_zeng | Original post link

Of course, there is. The task reports that it can’t find a relevant worker, otherwise, how could it be called a zombie state?

| username: 裤衩儿飞上天 | Original post link

Are there any related error reports for the task?

| username: 考试没答案 | Original post link

Is it working now? “Free” means no data source is bound. It means no data source is allocated.

| username: 考试没答案 | Original post link

How many workers do you currently have? How many workers are currently in a free state, and how many workers are in a bound state?

| username: jackerzhou | Original post link

There are four states for worker nodes: Offline, Free (online but not configured with source data end), Bound (online and bound to the source data end), and Relay (online and extracting relevant relay logs). The “Free” state you are seeing should mean that the source database instance is not bound.

| username: Jjjjayson_zeng | Original post link

It’s not good yet. Created a new worker and bound it to the newly created one.

| username: Jjjjayson_zeng | Original post link

Otherwise, the task cannot be stopped.

| username: 考试没答案 | Original post link

Revlant worker ghost. I remember part of the content saying that you must keep a few free workers. Otherwise, the switch cannot be completed in case of a failure. I forgot where I saw it. You can succeed by creating a new worker.