DM Data Synchronization Stopped, New Task Creation Error

translator_bot · June 21, 2024, 1:36am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: dm数据同步停止，新建任务异常

| username: Hacker007

New DM task synchronization exception:
“msg”: “[code=38032:class=dm-master:scope=internal:level=high], Message: some error occurs in dm-worker: ErrCode:10005 ErrClass:"database" ErrScope:"downstream" ErrLevel:"high" Message:"fail to initial unit Sync of subtask task_oss_merge_incremental_0304 : query statement failed: SELECT cp_schema, cp_table, binlog_name, binlog_pos, binlog_gtid, exit_safe_binlog_name, exit_safe_binlog_pos, exit_safe_binlog_gtid, table_info, is_global FROM task_oss_merge_incremental_0304.task_oss_merge_incremental_0304_syncer_checkpoint WHERE id = ?" RawCause:"Error 9005: Region is unavailable" , Workaround: Please execute query-status to check status.”,
“source”: “source_merge_154”,

TiKV exception: There are many such exceptions
[2024/03/03 04:40:22.586 +08:00] [ERROR] [peer.rs:3613] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5804039 store_id: 2”] [peer_id=5804038] [region_id=5804037] [type=MsgHibernateResponse]
[2024/03/03 04:40:22.586 +08:00] [ERROR] [peer.rs:3613] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5563775 store_id: 2”] [peer_id=5563774] [region_id=5563773] [type=MsgHibernateResponse]
[2024/03/03 04:40:22.586 +08:00] [ERROR] [peer.rs:3613] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5167111 store_id: 2”] [peer_id=5167110] [region_id=5167109] [type=MsgHibernateResponse]
[2024/03/03 04:40:22.586 +08:00] [ERROR] [peer.rs:3613] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5653847 store_id: 2”] [peer_id=5653846] [region_id=5653845] [type=MsgHibernateResponse]

Is this a TiKV partition issue? How to solve it? I expanded a machine, but it doesn’t seem to help.

translator_bot · June 21, 2024, 1:36am

| username: Jasper | Original post link

Are all the TiKV nodes currently in a normal state?

translator_bot · June 21, 2024, 1:36am

| username: Hacker007 | Original post link

All component nodes in the cluster are in a normal state.

translator_bot · June 21, 2024, 1:36am

| username: Jasper | Original post link

I see the error is transport(Full). Is it because TiKV is responding slowly? Check the Grafana monitoring to see if the overall response time of TiKV is normal?

translator_bot · June 21, 2024, 1:36am

| username: tidb菜鸟一只 | Original post link

Is your cluster okay? Try manually executing the following query: SELECT cp_schema, cp_table, binlog_name, binlog_pos, binlog_gtid, exit_safe_binlog_name, exit_safe_binlog_pos, exit_safe_binlog_gtid, table_info, is_global FROM task_oss_merge_incremental_0304.task_oss_merge_incremental_0304_syncer_checkpoint WHERE id = ?

translator_bot · June 21, 2024, 1:36am

| username: Hacker007 | Original post link

Error: Region is unavailable, cluster exception, large number of write timeouts.

translator_bot · June 21, 2024, 1:36am

| username: Hacker007 | Original post link

It should be an issue with TiKV, data cannot be written.

translator_bot · June 21, 2024, 1:36am

| username: Hacker007 | Original post link

It is abnormal. Some queries have issues, likely due to problems with certain partitions.

translator_bot · June 21, 2024, 1:36am

| username: Jasper | Original post link

Is it possible that a large SQL query has overwhelmed the KV store? Check if the CPU and IO are fully utilized.

translator_bot · June 21, 2024, 1:36am

| username: tidb菜鸟一只 | Original post link

Then your cluster is under too much pressure, and the data can’t be queried. DM must have reported an error. Check the logs and monitoring to see why the cluster is under such heavy pressure…

translator_bot · June 21, 2024, 1:36am

| username: Hacker007 | Original post link

Looking at the heatmap, the current read and write pressure is not high, and the cluster resources are also very sufficient. I have now stopped all DM tasks.

translator_bot · June 21, 2024, 1:36am

| username: Hacker007 | Original post link

Resources are plentiful.

translator_bot · June 21, 2024, 1:36am

| username: Hacker007 | Original post link

The old table and old data query normally, but the new table and new data query abnormally, reporting the error [Err] 9005 - Region is unavailable.

translator_bot · June 21, 2024, 1:36am

| username: tidb菜鸟一只 | Original post link

Is it still reporting an error now? Check it out

translator_bot · June 21, 2024, 1:36am

| username: Jasper | Original post link

You can check the region health in the Grafana monitoring PD to see if there are any abnormalities with the regions.

translator_bot · June 21, 2024, 1:36am

| username: Hacker007 | Original post link

After waiting for a while, it worked. The reason was that there was a table containing large fields. After filtering it and resynchronizing, it worked. I don’t know why this task didn’t report an error but affected the entire cluster.

translator_bot · June 21, 2024, 1:36am

| username: Hacker007 | Original post link

Okay, now we wait for the data to synchronize.

translator_bot · June 21, 2024, 1:36am

| username: dba远航 | Original post link

KV: Raftstore: Transport It seems that there is an anomaly in the log application of TiKV.

translator_bot · June 21, 2024, 1:36am

| username: redgame | Original post link

It is generally an error in the DM-worker, which may be caused by an exception in the downstream database. Please check.

translator_bot · June 21, 2024, 1:36am

| username: kelvin | Original post link

Is the cluster functioning properly?