TiCDC Error Reporting

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc异常报错

| username: liyuntang

【TiDB Usage Environment】Production Environment / Testing / PoC
K8s deployed testing environment
【TiDB Version】
5.4.1
【Reproduction Path】
Synchronize incremental data from TiDB to MySQL
【Encountered Problem: Problem Phenomenon and Impact】
Due to the large number of data tables in this TiDB (280,000 tables), after enabling TiCDC synchronization, it remains stuck and no other operations can be performed.

Synchronization Task:
curl -X POST -H “‘Content-type’:‘application/json’” http://192.160.139.56:8301/api/v1/changefeeds -d ‘{“force_replicate”: true, “start_ts”:439384345192955912, “changefeed_id”:“wps-mysql”,“sink_uri”:“mysql://admin:dsflafvas8d@x.x.x.x:x.x.x./?time-zone=&worker-count=16&max-txn-row=256”}’

Query Synchronization Information:
[root@vm10-32-1-164 ticdc]# curl -X GET http://192.160.139.56:8301/api/v1/changefeeds?state=all
{
“error_msg”: “context deadline exceeded”,
“error_code”: “”

【Resource Configuration】
TiDB, PD, TiKV, TiCDC: 3 replicas, 8 cores 32G
【Attachments: Screenshots/Logs/Monitoring】
TiCDC Logs:
[2023/02/13 13:08:36.060 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2152242] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.071 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2476130] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.083 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2342500] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.095 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2446730] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.107 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2290403] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.113 +08:00] [WARN] [client.go:272] [“etcd client outCh blocking too long, the etcdWorker may be stuck”] [duration=42.000288667s] [role=owner]
[2023/02/13 13:08:36.119 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2161778] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.131 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2505686] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.142 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2138490] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.154 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2245614] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.166 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2064686] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.178 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2543234] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.189 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2090784] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.201 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2034058] [boundaryTs=439384345192955912]
[2023/02/13 13:08:36.216 +08:00] [INFO] [owner.go:280] [“add a table”] [tableId=2342768] [boundaryTs=439384345192955912]

TiCDC Monitoring:

| username: Billmay表妹 | Original post link

It seems similar to this issue~

This problem has been fixed in versions after 6.0. You can try the new version 6.5.0 in your test environment.

| username: Meditator | Original post link

  1. The upstream TiDB has 280,000 tables, which is indeed quite a lot. Are they in the same database or distributed across multiple databases? Can you try splitting them into multiple changefeeds?
  2. The error log clearly shows that the PD-Server is stuck. How is the PD cluster deployed? Etcd has relatively high IO requirements.
  3. You can check the load situation of etcd to see which ones have reached their bottleneck.
| username: liyuntang | Original post link

  1. There are 280,000 tables distributed across 3,000 databases, with each database containing about 100 tables. Currently, we plan to handle this with multiple changefeeds, but TiCDC is stuck and unresponsive, making it impossible to delete the current changefeed.

  2. The PD cluster is deployed in Kubernetes with 3 replicas, 4 cores, 8GB memory, and 100GB SSD disks. The monitoring chart for the PD over the last 24 hours is as follows:

The current PD leader logs are as follows:
[2023/02/14 09:00:57.713 +08:00] [INFO] [operator_controller.go:635] [“send schedule command”] [region-id=1202584] [step=“transfer leader from store 1 to store 5”] [source=create]
[2023/02/14 09:00:57.714 +08:00] [INFO] [region.go:543] [“leader changed”] [region-id=1202584] [from=1] [to=5]
[2023/02/14 09:00:57.715 +08:00] [INFO] [operator_controller.go:552] [“operator finish”] [region-id=1202584] [takes=1.684627ms] [operator=“"transfer-hot-read-leader {transfer leader: store 1 to 5} (kind:hot-region,leader, region:1202584(86,17), createAt:2023-02-14 09:00:57.712943653 +0800 CST m=+941027.276799437, startAt:2023-02-14 09:00:57.713414336 +0800 CST m=+941027.277270130, currentStep:1, steps:[transfer leader from store 1 to store 5]) finished"”] [additional-info=]
[2023/02/14 09:00:59.114 +08:00] [INFO] [operator_controller.go:437] [“add operator”] [region-id=1030033] [operator=“"balance-leader {transfer leader: store 5 to 1} (kind:leader, region:1030033(81,17), createAt:2023-02-14 09:00:59.114682035 +0800 CST m=+941028.678537814, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 5 to store 1])"”] [additional-info=“{"sourceScore":"236.00","targetScore":"234.00"}”]
[2023/02/14 09:00:59.114 +08:00] [INFO] [operator_controller.go:635] [“send schedule command”] [region-id=1030033] [step=“transfer leader from store 5 to store 1”] [source=create]
[2023/02/14 09:00:59.116 +08:00] [INFO] [region.go:543] [“leader changed”] [region-id=1030033] [from=5] [to=1]
[2023/02/14 09:00:59.116 +08:00] [INFO] [operator_controller.go:552] [“operator finish”] [region-id=1030033] [takes=1.435858ms] [operator=“"balance-leader {transfer leader: store 5 to 1} (kind:leader, region:1030033(81,17), createAt:2023-02-14 09:00:59.114682035 +0800 CST m=+941028.678537814, startAt:2023-02-14 09:00:59.114890233 +0800 CST m=+941028.678746021, currentStep:1, steps:[transfer leader from store 5 to store 1]) finished"”] [additional-info=“{"sourceScore":"236.00","targetScore":"234.00"}”]
[2023/02/14 09:01:37.726 +08:00] [INFO] [operator_controller.go:437] [“add operator”] [region-id=1202596] [operator=“"transfer-hot-read-leader {transfer leader: store 1 to 4} (kind:hot-region,leader, region:1202596(88,17), createAt:2023-02-14 09:01:37.726820516 +0800 CST m=+941067.290676272, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 1 to store 4])"”] [additional-info=]
[2023/02/14 09:01:37.727 +08:00] [INFO] [operator_controller.go:635] [“send schedule command”] [region-id=1202596] [step=“transfer leader from store 1 to store 4”] [source=create]
[2023/02/14 09:01:37.729 +08:00] [INFO] [region.go:543] [“leader changed”] [region-id=1202596] [from=1] [to=4]
[2023/02/14 09:01:37.729 +08:00] [INFO] [operator_controller.go:552] [“operator finish”] [region-id=1202596] [takes=2.227342ms] [operator=“"transfer-hot-read-leader {transfer leader: store 1 to 4} (kind:hot-region,leader, region:1202596(88,17), createAt:2023-02-14 09:01:37.726820516 +0800 CST m=+941067.290676272, startAt:2023-02-14 09:01:37.727068308 +0800 CST m=+941067.290924071, currentStep:1, steps:[transfer leader from store 1 to store 4]) finished"”] [additional-info=]
[2023/02/14 09:01:42.193 +08:00] [INFO] [operator_controller.go:437] [“add operator”] [region-id=1201876] [operator=“"balance-leader {transfer leader: store 5 to 1} (kind:leader, region:1201876(107,17), createAt:2023-02-14 09:01:42.193314649 +0800 CST m=+941071.757170398, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 5 to store 1])"”] [additional-info=“{"sourceScore":"235.00","targetScore":"234.00"}”]
[2023/02/14 09:01:42.193 +08:00] [INFO] [operator_controller.go:635] [“send schedule command”] [region-id=1201876] [step=“transfer leader from store 5 to store 1”] [source=create]
[2023/02/14 09:01:42.194 +08:00] [INFO] [region.go:543] [“leader changed”] [region-id=1201876] [from=5] [to=1]
[2023/02/14 09:01:42.195 +08:00] [INFO] [operator_controller.go:552] [“operator finish”] [region-id=1201876] [takes=1.671202ms] [operator=“"balance-leader {transfer leader: store 5 to 1} (kind:leader, region:1201876(107,17), createAt:2023-02-14 09:01:42.193314649 +0800 CST m=+941071.757170398, startAt:2023-02-14 09:01:42.193461837 +0800 CST m=+941071.757317621, currentStep:1, steps:[transfer leader from store 5 to store 1]) finished"”] [additional-info=“{"sourceScore":"235.00","targetScore":"234.00"}”]
[2023/02/14 09:01:54.413 +08:00] [INFO] [operator_controller.go:437] [“add operator”] [region-id=1202580] [operator=“"transfer-hot-write-leader {transfer leader: store 5 to 1} (kind:hot-region,leader, region:1202580(87,17), createAt:2023-02-14 09:01:54.413581571 +0800 CST m=+941083.977437355, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 5 to store 1])"”] [additional-info=]
[2023/02/14 09:01:54.414 +08:00] [INFO] [operator_controller.go:635] [“send schedule command”] [region-id=1202580] [step=“transfer leader from store 5 to store 1”] [source=create]
[2023/02/14 09:01:54.415 +08:00] [INFO] [region.go:543] [“leader changed”] [region-id=1202580] [from=5] [to=1]
[2023/02/14 09:01:54.415 +08:00] [INFO] [operator_controller.go:552] [“operator finish”] [region-id=1202580] [takes=1.654517ms] [operator=“"transfer-hot-write-leader {transfer leader: store 5 to 1} (kind:hot-region,leader, region:1202580(87,17), createAt:2023-02-14 09:01:54.413581571 +0800 CST m=+941083.977437355, startAt:2023-02-14 09:01:54.414144201 +0800 CST m=+941083.977999990, currentStep:1, steps:[transfer leader from store 5 to store 1]) finished"”] [additional-info=]
[2023/02/14 09:01:55.714 +08:00] [INFO] [operator_controller.go:437] [“add operator”] [region-id=1202620] [operator=“"transfer-hot-write-leader {transfer leader: store 5 to 4} (kind:hot-region,leader, region:1202620(88,17), createAt:2023-02-14 09:01:55.714092001 +0800 CST m=+941085.277947783, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 5 to store 4])"”] [additional-info=]
[2023/02/14 09:01:55.714 +08:00] [INFO] [operator_controller.go:635] [“send schedule command”] [region-id=1202620] [step=“transfer leader from store 5 to store 4”] [source=create]
[2023/02/14 09:01:55.716 +08:00] [INFO] [region.go:543] [“leader changed”] [region-id=1202620] [from=5] [to=4]
[2023/02/14 09:01:55.716 +08:00] [INFO] [operator_controller.go:552] [“operator finish”] [region-id=1202620] [takes=2.082143ms] [operator=“"transfer-hot-write-leader {transfer leader: store 5 to 4} (kind:hot-region,leader, region:1202620(88,17), createAt:2023-02-14 09:01:55.714092001 +0800 CST m=+941085.277947783, startAt:2023-02-14 09:01:55.714406114 +0800 CST m=+941085.278261902, currentStep:1, steps:[transfer leader from store 5 to store 4]) finished"”] [additional-info=]
[2023/02/14 09:04:54.814 +08:00] [INFO] [operator_controller.go:437] [“add operator”] [region-id=1202612] [operator=“"transfer-hot-write-leader {transfer leader: store 5 to 1} (kind:hot-region,leader, region:1202612(87,17), createAt:2023-02-14 09:04:54.813887497 +0800 CST m=+941264.377743282, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 5 to store 1])"”] [additional-info=]
[2023/02/14 09:04:54.814 +08:00] [INFO] [operator_controller.go:635] [“send schedule command”] [region-id=1202612] [step=“transfer leader from store 5 to store 1”] [source=create]
[2023/02/14 09:04:54.815 +08:00] [INFO] [region.go:543] [“leader changed”] [region-id=1202612] [from=5] [to=1]
[2023/02/14 09:04:54.816 +08:00] [INFO] [operator_controller.go:552] [“operator finish”] [region-id=1202612] [takes=1.80506ms] [operator=“"transfer-hot-write-leader {transfer leader: store 5 to 1} (kind:hot-region,leader, region:1202612(87,17), createAt:2023-02-14 09:04:54.813887497 +0800 CST m=+941264.377743282, startAt:2023-02-14 09:04:54.814324868 +0800 CST m=+941264.378180654, currentStep:1, steps:[transfer leader from store 5 to store 1]) finished"”] [additional-info=]
[2023/02/14 09:04:56.114 +08:00] [INFO] [operator_controller.go:437] [“add operator”] [region-id=1202624] [operator=“"transfer-hot-read-leader {transfer leader: store 5 to 4} (kind:hot-region,leader, region:1202624(88,17), createAt:2023-02-14 09:04:56.114255348 +0800 CST m=+941265.678111133, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 5 to store 4])"”] [additional-info=]
[2023/02/14 09:04:56.114 +08:00] [INFO] [operator_controller.go:635] [“send schedule command”] [region-id=1202624] [step=“transfer leader from store 5 to store 4”] [source=create]
[2023/02/14 09:04:56.116 +08:00] [INFO] [region.go:543] [“leader changed”] [region-id=1202624] [from=5] [to=4]
[2023/02/14 09:04:56.116 +08:00] [INFO] [operator_controller.go:552] [“operator finish”] [region-id=1202624] [takes=2.007555ms] [operator=“"transfer-hot-read-leader {transfer leader: store 5 to 4} (kind:hot-region,leader, region:1202624(88,17), createAt:2023-02-14 09:04:56.114255348 +0800 CST m=+941265.678111133, startAt:2023-02-14 09:04:56.114821758 +0800 CST m=+941265.678677547, currentStep:1, steps:[transfer leader from store 5 to store 4]) finished"”] [additional-info=]
[2023/02/14 09:07:55.136 +08:00] [INFO] [operator_controller.go:437] [“add operator”] [region-id=1202576] [operator=“"transfer-hot-read-leader {transfer leader: store 5 to 4} (kind:hot-region,leader, region:1202576(88,17), createAt:2023-02-14 09:07:55.13624988 +0800 CST m=+941444.700105662, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 5 to store 4])"”] [additional-info=]
[2023/02/14 09:07:55.136 +08:00] [INFO] [operator_controller.go:635] [“send schedule command”] [region-id=1202576] [step=“transfer leader from store 5 to store 4”] [source=create]
[2023/02/14 09:07:55.138 +08:00] [INFO] [region.go:543] [“leader changed”] [region-id=1202576] [from=5] [to=4]
[2023/02/14 09:07:55.139 +08:00] [INFO] [operator_controller.go:552] [“operator finish”] [region-id=1202576] [takes=2.390309ms] [operator=""transfer-hot-read-leader {transfer leader: store 5 to 4} (kind:hot-region,leader, region:1202576(88,17), createAt:2023-02-14 09:07:55.13624988 +0800 CST m=+941444.700105662, startAt:2023-02-14 09:07:55.136876198 +0800 CST m=+941444.700731999, currentStep:1, steps:[transfer

| username: liyuntang | Original post link

The above content is the ticdc log.

  1. Does “add a table” correspond to the schedule’s Add Table process?
  2. How to check the progress of add table?
  3. Does the error “etcd client outCh blocking too long, the etcdWorker may be stuck” trigger this bug cdc emit row changed event to downstream kafka may block a long time. · Issue #4987 · pingcap/tiflow · GitHub?
| username: Meditator | Original post link

It looks like this according to the issue.

| username: neilshen | Original post link

Currently, TiCDC only supports synchronizing TiDB clusters with up to 100,000 tables (including the latest version 6.5). You can add [filter.rules] in the changefeed configuration to filter out unnecessary tables, thereby reducing the number of synchronized tables.

| username: liyuntang | Original post link

This parameter has already been configured. Currently, each changefeed synchronizes about 3000 tables. The configuration is as follows:

{
“info”: {
“sink-uri”: “mysql://admin:ghj8d@10…180:21268/?time-zone=&worker-count=16&max-txn-row=5000”,
“opts”: {
“_changefeed_id”: “sink-verify”
},
“create-time”: “2023-02-14T17:42:29.961246216+08:00”,
“start-ts”: 439447996443131906,
“target-ts”: 0,
“admin-job-type”: 0,
“sort-engine”: “unified”,
“sort-dir”: “”,
“config”: {
“case-sensitive”: true,
“enable-old-value”: true,
“force-replicate”: true,
“check-gc-safe-point”: true,
“filter”: {
“rules”: [
“cwbaseshow1025.",
"cwbaseshow1026.
”,
“cwbaseshow1027.",
"cwbaseshow1028.
”,
“cwbaseshow1029.",
"cwbaseshow103.
”,
“cwbaseshow1030.",
"cwbaseshow1031.
”,
“cwbaseshow1032.",
"cwbaseshow1033.
”,
“cwbaseshow1034.",
"cwbaseshow1035.
”,
“cwbaseshow1036.",
"cwbaseshow1037.
”,
“cwbaseshow1038.",
"cwbaseshow1039.
”,
“cwbaseshow104.",
"cwbaseshow1040.
”,
“cwbaseshow1041.",
"cwbaseshow1042.
”,
“cwbaseshow1043.",
"cwbaseshow1044.
”,
“cwbaseshow1045.",
"cwbaseshow1046.
”,
“cwbaseshow1047.",
"cwbaseshow1048.
”,
“cwbaseshow1049.",
"cwbaseshow105.
”,
“cwbaseshow1050.",
"cwbaseshow1051.

],
“ignore-txn-start-ts”: null
},
“mounter”: {
“worker-num”: 16
},
“sink”: {
“dispatchers”: null,
“protocol”: “”,
“column-selectors”: null
},
“cyclic-replication”: {
“enable”: false,
“replica-id”: 0,
“filter-replica-ids”: null,
“id-buckets”: 0,
“sync-ddl”: false
},
“scheduler”: {
“type”: “table-number”,
“polling-time”: -1
},
“consistent”: {
“level”: “none”,
“max-log-size”: 64,
“flush-interval”: 1000,
“storage”: “”
}
},
“state”: “normal”,
“error”: null,
“sync-point-enabled”: false,
“sync-point-interval”: 600000000000,
“creator-version”: “v5.4.1”
},
“status”: {
“resolved-ts”: 439450174251597830,
“checkpoint-ts”: 439450173989191687,
“admin-job-type”: 0
},
“count”: 0,
“task-status”: [
{
“capture-id”: “40618b2a-e18b-47e2-958c-20be07da05a8”,
“status”: {
“tables”: {
“2007506”: {
“start-ts”: 439447996443131906,
“mark-table-id”: 0
},
“2007756”: {
“start-ts”: 439447996443131906,
“mark-table-id”: 0
},
“2007912”: {
“start-ts”: 439447996443131906,
“mark-table-id”: 0
},
“2008074”: {
“start-ts”: 439447996443131906,
“mark-table-id”: 0
},
“2008122”: {
“start-ts”: 439447996443131906,
“mark-table-id”: 0
},

The changefeed creation is very slow, taking about 20 minutes to complete. The ticdc logs contain a large number of the following warnings:

[2023/02/14 17:46:14.634 +08:00] [WARN] [client.go:272] [“etcd client outCh blocking too long, the etcdWorker may be stuck”] [duration=12.997394111s] [role=processor]
[2023/02/14 17:46:15.635 +08:00] [WARN] [client.go:272] [“etcd client outCh blocking too long, the etcdWorker may be stuck”] [duration=13.99796724s] [role=processor]
[2023/02/14 17:57:16.831 +08:00] [WARN] [client.go:272] [“etcd client outCh blocking too long, the etcdWorker may be stuck”] [duration=3m9.997947845s] [role=processor]
[2023/02/14 17:57:17.831 +08:00] [WARN] [client.go:272] [“etcd client outCh blocking too long, the etcdWorker may be stuck”] [duration=3m10.997595524s] [role=processor]
[2023/02/14 18:01:04.293 +08:00] [WARN] [client.go:272] [“etcd client outCh blocking too long, the etcdWorker may be stuck”] [duration=10.58260022s] [role=owner]
[2023/02/14 18:01:05.293 +08:00] [WARN] [client.go:272] [“etcd client outCh blocking too long, the etcdWorker may be stuck”] [duration=11.582521251s] [role=owner]

| username: neilshen | Original post link

This type of log indicates that TiCDC is approaching its synchronization limit. The number of tables synchronized by TiCDC in version 5 is limited. It is recommended to use version 6.1 LTS, which will be much more stable.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.