DM Data Synchronization Stopped, New Task Creation Error

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: dm数据同步停止,新建任务异常

| username: Hacker007

New DM task synchronization exception:
“msg”: “[code=38032:class=dm-master:scope=internal:level=high], Message: some error occurs in dm-worker: ErrCode:10005 ErrClass:"database" ErrScope:"downstream" ErrLevel:"high" Message:"fail to initial unit Sync of subtask task_oss_merge_incremental_0304 : query statement failed: SELECT cp_schema, cp_table, binlog_name, binlog_pos, binlog_gtid, exit_safe_binlog_name, exit_safe_binlog_pos, exit_safe_binlog_gtid, table_info, is_global FROM task_oss_merge_incremental_0304.task_oss_merge_incremental_0304_syncer_checkpoint WHERE id = ?" RawCause:"Error 9005: Region is unavailable" , Workaround: Please execute query-status to check status.”,
“source”: “source_merge_154”,

TiKV exception: There are many such exceptions
[2024/03/03 04:40:22.586 +08:00] [ERROR] [peer.rs:3613] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5804039 store_id: 2”] [peer_id=5804038] [region_id=5804037] [type=MsgHibernateResponse]
[2024/03/03 04:40:22.586 +08:00] [ERROR] [peer.rs:3613] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5563775 store_id: 2”] [peer_id=5563774] [region_id=5563773] [type=MsgHibernateResponse]
[2024/03/03 04:40:22.586 +08:00] [ERROR] [peer.rs:3613] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5167111 store_id: 2”] [peer_id=5167110] [region_id=5167109] [type=MsgHibernateResponse]
[2024/03/03 04:40:22.586 +08:00] [ERROR] [peer.rs:3613] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5653847 store_id: 2”] [peer_id=5653846] [region_id=5653845] [type=MsgHibernateResponse]

Is this a TiKV partition issue? How to solve it? I expanded a machine, but it doesn’t seem to help.

| username: Jasper | Original post link

Are all the TiKV nodes currently in a normal state?

| username: Hacker007 | Original post link

All component nodes in the cluster are in a normal state.

| username: Jasper | Original post link

I see the error is transport(Full). Is it because TiKV is responding slowly? Check the Grafana monitoring to see if the overall response time of TiKV is normal?

| username: tidb菜鸟一只 | Original post link

Is your cluster okay? Try manually executing the following query: SELECT cp_schema, cp_table, binlog_name, binlog_pos, binlog_gtid, exit_safe_binlog_name, exit_safe_binlog_pos, exit_safe_binlog_gtid, table_info, is_global FROM task_oss_merge_incremental_0304.task_oss_merge_incremental_0304_syncer_checkpoint WHERE id = ?

| username: Hacker007 | Original post link

Error: Region is unavailable, cluster exception, large number of write timeouts.

| username: Hacker007 | Original post link

It should be an issue with TiKV, data cannot be written.

| username: Hacker007 | Original post link

It is abnormal. Some queries have issues, likely due to problems with certain partitions.

| username: Jasper | Original post link

Is it possible that a large SQL query has overwhelmed the KV store? Check if the CPU and IO are fully utilized.

| username: tidb菜鸟一只 | Original post link

Then your cluster is under too much pressure, and the data can’t be queried. DM must have reported an error. Check the logs and monitoring to see why the cluster is under such heavy pressure…

| username: Hacker007 | Original post link

Looking at the heatmap, the current read and write pressure is not high, and the cluster resources are also very sufficient. I have now stopped all DM tasks.

| username: Hacker007 | Original post link

Resources are plentiful.

| username: Hacker007 | Original post link

The old table and old data query normally, but the new table and new data query abnormally, reporting the error [Err] 9005 - Region is unavailable.

| username: tidb菜鸟一只 | Original post link

Is it still reporting an error now? Check it out

| username: Jasper | Original post link

You can check the region health in the Grafana monitoring PD to see if there are any abnormalities with the regions.

| username: Hacker007 | Original post link

After waiting for a while, it worked. The reason was that there was a table containing large fields. After filtering it and resynchronizing, it worked. I don’t know why this task didn’t report an error but affected the entire cluster.

| username: Hacker007 | Original post link

Okay, now we wait for the data to synchronize.

| username: dba远航 | Original post link

KV: Raftstore: Transport It seems that there is an anomaly in the log application of TiKV.

| username: redgame | Original post link

It is generally an error in the DM-worker, which may be caused by an exception in the downstream database. Please check.

| username: kelvin | Original post link

Is the cluster functioning properly?