CDC Task Abnormally Interrupted?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: CDC任务异常中断?

| username: 孤君888

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed to reproduce the issue
[Encountered Issue: Issue Phenomenon and Impact]

Background: My TiDB cluster v6.1.0 is running on CentOS 7. PD and TiDB-SERVER are configured on three virtual machines (shared virtualization host, non-SSD disk), while TIKV-SERVER is configured on three physical machines (SSD disk). The TICDC cluster is also reused on the three TIKV machines.

Upstream is TiDB-6.1.0, downstream is MYSQL-Percona Server 5.7

The log of a certain TICDC node is as follows:

[2023/07/18 09:32:10.889 +08:00] [ERROR] [processor.go:546] ["error on running processor"] [capture=10.116.172.206:8300] [changefeed=simple-replication-task] [error="[CDC:ErrFlowControllerEventLargerThanQuota]event is larger than the total memory quota, size: 12602807, quota: 10485760"] [errorVerbose="[CDC:ErrFlowControllerEventLargerThanQuota]event is larger than the total memory quota, size: 12602807, quota: 10485760\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/cdc/sink/flowcontrol.(*tableMemoryQuota).consumeWithBlocking\n\tgithub.com/pingcap/tiflow/cdc/sink/flowcontrol/table_memory_quota.go:59\ngithub.com/pingcap/tiflow/cdc/sink/flowcontrol.(*TableFlowController).Consume\n\tgithub.com/pingcap/tiflow/cdc/sink/flowcontrol/flow_control.go:133\ngithub.com/pingcap/tiflow/cdc/processor/pipeline.(*sorterNode).start.func3\n\tgithub.com/pingcap/tiflow/cdc/processor/pipeline/sorter.go:250\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:74\nruntime.goexit\n\truntime/asm_amd64.s:1571"]

[2023/07/18 09:32:10.890 +08:00] [ERROR] [processor.go:355] ["run processor failed"] [changefeed=simple-replication-task] [capture=10.116.172.206:8300] [error="[CDC:ErrFlowControllerEventLargerThanQuota]event is larger than the total memory quota, size: 12602807, quota: 10485760"] [errorVerbose="[CDC:ErrFlowControllerEventLargerThanQuota]event is larger than the total memory quota, size: 12602807, quota: 10485760\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/cdc/sink/flowcontrol.(*tableMemoryQuota).consumeWithBlocking\n\tgithub.com/pingcap/tiflow/cdc/sink/flowcontrol/table_memory_quota.go:59\ngithub.com/pingcap/tiflow/cdc/sink/flowcontrol.(*TableFlowController).Consume\n\tgithub.com/pingcap/tiflow/cdc/sink/flowcontrol/flow_control.go:133\ngithub.com/pingcap/tiflow/cdc/processor/pipeline.(*sorterNode).start.func3\n\tgithub.com/pingcap/tiflow/cdc/processor/pipeline/sorter.go:250\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:74\nruntime.goexit\n\truntime/asm_amd64.s:1571"]

Using the TIUP tool, the error is as follows:

#   tiup  cdc cli changefeed  list --pd x.x.x.x:2379

Starting component `cdc`: /home/tidb/.tiup/components/cdc/v6.1.1/cdc cli changefeed list --pd x.x.x.x:2379
[
  {
    "id": "simple-replication-task",
    "summary": {
      "state": "failed",
      "tso": 442929853387243521,
      "checkpoint": "2023-07-18 09:21:40.580",
      "error": {
        "addr": "x.x.x.x:8300",
        "code": "CDC:ErrFlowControllerEventLargerThanQuota",
        "message": "[CDC:ErrFlowControllerEventLargerThanQuota]event is larger than the total memory quota, size: 12602807, quota: 10485760"
      }
    }
  }
]

How can I solve this problem? The error seems to indicate that memory usage exceeds the quota, right? Is it the memory usage of TICDC?

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: tidb狂热爱好者 | Original post link

Increase the memory of the CDC machine. The memory is insufficient.

| username: zhanggame1 | Original post link

Some people have encountered the same issue, refer to this:
Error ErrFlowControllerEventLargerThanQuota during CDC synchronization - :ringer_planet: TiDB Technical Issues / Deployment & Operations Management - TiDB Q&A Community (asktug.com)

| username: tidb狂热爱好者 | Original post link

You also didn’t mention how much memory each machine node has. The error indicates insufficient memory.

| username: tidb狂热爱好者 | Original post link

The default value of per-table-memory-quota is 10485760, which is a parameter of cdc-server. You can change it in the cdc YAML configuration.

| username: tidb菜鸟一只 | Original post link

Increase this parameter a bit.

| username: 扬仔_tidb | Original post link

May I ask how large everyone sets the per-table-memory-quota value? I am about to migrate 1TB of data. Can this value be reconfigured during the migration? How can I estimate it in advance?

| username: 孤君888 | Original post link

The machine has 128GB of memory, and both TIKV-SERVER and CDC-CLUSTER are reused without separate memory and CPU resource isolation for these two components. Also, does CDC consume that much memory?

| username: 孤君888 | Original post link

The status of my task is currently “failed.” After I modify this parameter, do I need to delete the task and create a new one to restart the synchronization?

| username: redgame | Original post link

Delete-Enlarge-Rebuild

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.