DM has been syncing for two months, and suddenly the master node went down. Reviewing the error messages

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: DM同步了两个月时间,突然master节点down掉了,翻看报错信息

| username: TiDBer_STGGd1J1

The memory, CPU, and disk usage were not high at that time. Can any experts help take a look?

| username: TiDBer_STGGd1J1 | Original post link

The error in dm-master-stderr.log is

| username: TiDBer_STGGd1J1 | Original post link

Errors started occurring at 2 AM on June 16th.

[2024/06/16 02:01:59.890 +08:00] [WARN] [server.go:1079] [“failed to revoke lease”] [component=“embed etcd”] [lease-id=08f58c96d80b963e] [error=“etcdserver: request timed out”]

[2024/06/16 02:03:00.407 +08:00] [WARN] [util.go:121] [“failed to apply request”] [component=“embed etcd”] [took=78.553µs] [request=“header:<ID:645576701095157679 > lease_revoke:id:08f58c96d80b95e6”] [response=size:28] [error=“lease not found”]

[2024/06/16 02:03:02.362 +08:00] [WARN] [util.go:163] [“apply request took too long”] [component=“embed etcd”] [took=1.293770921s] [expected-duration=100ms] [prefix=] [request=“header:<ID:645576701095157767 > lease_grant:<ttl:60-second id:08f58c96d80ba006>”] [response=size:40]

| username: zhaokede | Original post link

When communicating via Remote Procedure Call (RPC), the operation exceeded the set deadline. Network latency, resource bottlenecks, high system load, or too short a configured time limit can cause the operation to not complete within the expected time. Determine if there is network jitter during this period.

| username: TiDBer_STGGd1J1 | Original post link

My subsequent operation was to restart the master node, which restored the system. Currently, I need to investigate the cause of the issue that led to the task synchronization being stalled for a day.

| username: TiDBer_STGGd1J1 | Original post link

I asked someone to check the TiDB network fluctuations, and they said there were no fluctuations.

| username: TiDBer_STGGd1J1 | Original post link

Could it be that operations are not allowed during the backup period of the DM host?

| username: TIDB-Learner | Original post link

I haven’t used DM yet. Here are some suggestions based on other operational experiences: 1. Check if the table structure has changed. 2. Determine if there are large-scale operations causing resource shortages by observing the logs before the issue occurred. 3. Human error due to unintentional operations.

| username: xfworld | Original post link

What busy operations cause PD to be unresponsive?