TiCDC did not trigger the ticdc_processor_exit_with_error_count alert

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc没有发出来ticdc_processor_exit_with_error_count告警

| username: 开心大河马

【TiDB Usage Environment】Production Environment
【TiDB Version】
V7.1.0
【Reproduction Path】
Performed partition exchange operation in production, ticdc synchronization reported an error
【Encountered Problem: Phenomenon and Impact】
【Resource Configuration】
DDL was performed on the source end, ticdc had issues causing task failure, and errors were reported.

The error metrics on the Grafana page cannot be seen at the moment because task 1 has already been deleted.
image

The Grafana page shows that task xxxx-to-odsmysql-task1 has failed, and another test task that was deleted a long time ago is still showing.

cdc log:

[2023/09/14 12:31:17.512 +08:00] [ERROR] [processor.go:1015] ["processor sub-component fails"] [namespace=default] [changefeed=xxxx-to-odsmysql-task1] [name=ddlHandler] [error="[CDC:ErrSnapshotTableNotFound]table 341 not found in schema snapshot"] [errorVerbose="[CDC:ErrSnapshotTableNotFound]table 341 not found in schema snapshot\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/normalize.go:164\ngithub.com/pingcap/tiflow/cdc/entry/schema.(*snapshot).getSourceTable\n\tgithub.com/pingcap/tiflow/cdc/entry/schema/snapshot.go:933\ngithub.com/pingcap/tiflow/cdc/entry/schema.(*snapshot).exchangePartition\n\tgithub.com/pingcap/tiflow/cdc/entry/schema/snapshot.go:957\ngithub.com/pingcap/tiflow/cdc/entry/schema.(*Snapshot).DoHandleDDL\n\tgithub.com/pingcap/tiflow/cdc/entry/schema/snapshot.go:468\ngithub.com/pingcap/tiflow/cdc/entry/schema.(*Snapshot).HandleDDL\n\tgithub.com/pingcap/tiflow/cdc/entry/schema/snapshot.go:382\ngithub.com/pingcap/tiflow/cdc/entry.(*schemaStorageImpl).HandleDDLJob\n\tgithub.com/pingcap/tiflow/cdc/entry/schema_storage.go:207\ngithub.com/pingcap/tiflow/cdc/puller.(*ddlJobPullerImpl).handleJob\n\tgithub.com/pingcap/tiflow/cdc/puller/ddl_puller.go:405\ngithub.com/pingcap/tiflow/cdc/puller.(*ddlJobPullerImpl).Run.func2\n\tgithub.com/pingcap/tiflow/cdc/puller/ddl_puller.go:123\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1598"]

However, the alert for the error was not sent out.

A few hours later, a new task xxxx-to-odsmysql-task1-track was added, and the alert for delay was displayed normally.
I would like to consult if anyone knows,

  1. Why was the error for task xxxx-to-odsmysql-task1 not sent out?
  2. What metrics can be used to display such failures, is there a template, and how can this metric be used to send out such failure alerts in Grafana?
  3. How can previously deleted test tasks still be visible?
| username: wangkk2024 | Original post link

Here to learn.