How to Query TiCDC Synchronization Task Status in Prometheus

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 如何在prometheus 中查询 TiCDC 同步任务状态

| username: TiDBer_yyy

[TiDB Usage Environment] Production Environment

[TiDB Version] 5.0.4

[Encountered Problem: Issue Phenomenon and Impact]
Original Requirement: Trigger an alert when proactively pausing the ticdc synchronization task changefeed.

Problem:
Checked Grafana and found a dashboard displaying “changefeed status,” but there is no data. Querying metrics also shows no data.

| username: Lucien-卢西恩 | Original post link

Check if there is any data in Prometheus. The default platform port for Prometheus is 9090.

| username: TiDBer_yyy | Original post link

Thank you for the reply:

  1. There is no metric ticdc_owner_status in Prometheus
    curl -i http://127.0.0.1:8300/metrics | grep ticdc_owner_status

  2. Is there any other way to customize CDC alert rules to check the changefeed status?

| username: jansu-dev | Original post link

Hello,

  1. The current cluster has changefeeds, right?
  2. Don’t use grep directly. Sometimes this method won’t find it, but using curl | less | /ticdc_owner_status can.
  3. changefeedStatusGauge Normally, the owner updates the changefeed information periodically during Tick. Logically, either other monitoring also has issues, indicating a problem on the monitoring side, or there is an issue with the changefeed or capture owner, indicating a problem on the functionality synchronization side.
  4. By the way, at any given time, a ticdc cluster can only have one owner. It is possible that this monitoring is on another capture. Check this monitoring on the owner capture, which might be the reason it wasn’t found.
| username: TiDBer_yyy | Original post link

  1. Check the synchronization status of the current cluster changefeed. Original requirement: Alert if the changefeed status is not normal.
  2. There is no metric ticdc_owner_status in less.
    image
    3, 4 I don’t quite understand these two :see_no_evil:
| username: jansu-dev | Original post link

  1. Only the capture that is the owner will log this metric. Randomly selecting a capture might not be the capture owner. Therefore, looking at it this way might not be valuable. For the definition of owner, you can refer to → 专栏 - TiCDC系列分享-02-剖析同步模型与基本架构 | TiDB 社区

  2. I think it’s unlikely that there will be an issue here. If there is an issue, it likely indicates a problem with the synchronization functionality.

| username: jansu-dev | Original post link

  1. There is a dashboard in Grafana showing “changefeed status”, but there is no data, and there is no data when querying metrics.
    If other metrics are present and only this metric is missing, it is likely not because Prometheus did not pull from the capture port. It is more likely that there was an issue with the owner election between captures, resulting in no metrics being generated. Are there any anomalies in the logs?
| username: TiDBer_yyy | Original post link

  • Currently, there are two CDC nodes. After initiating a curl request, there are no ticdc_owner_status metrics.
curl 172.16.0.1:8300/metrics
curl 172.16.0.2:8300/metrics
  • Checking the CDC logs, the following errors are found,
[2022/12/08 00:31:29.256 +08:00] [ERROR] [owner.go:1354] ["watch owner campaign key failed, restart the watcher"] [error="etcdserver: mvcc: required revision has been compacted"]
[2022/12/08 00:31:29.288 +08:00] [WARN] [owner.go:1726] ["watch capture returned"] [error="[CDC:ErrOwnerEtcdWatch]etcdserver: mvcc: required revision has been compacted"] [errorVerbose="[CDC:ErrOwnerEtcdWatch]etcdserver: mvcc: required revision has been compacted\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc@/pkg/errors/helper.go:30\ngithub.com/pingcap/ticdc/cdc.(*Owner).watchCapture\n\tgithub.com/pingcap/ticdc@/cdc/owner.go:1636\ngithub.com/pingcap/ticdc/cdc.(*Owner).startCaptureWatcher.func1\n\tgithub.com/pingcap/ticdc@/cdc/owner.go:1713\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
| username: jansu-dev | Original post link

ErrEventFeedEventError theoretically should not interfere with the Owner election. Which Capture is currently shown as the Owner in the monitoring? Could you please upload the TiCDC panel and the Owner logs (for the recent period)?

| username: TiDBer_yyy | Original post link

  • TiCDC Panel

  • Owner logs (the key has been sent to you via private message)
    cdc.log.tar.gz (13.5 MB)
| username: jansu-dev | Original post link

  1. From the logs and panel, there doesn’t seem to be any interference with the normal synchronization function.
  2. Secondly, tracing back from the update metrics, there are too many possibilities, and the logs don’t provide much useful advice :thinking:
  3. There is one part in the logs that might be related to this, as shown below.

In summary, the current suggestion is to restart all captures to see if it can be restored. It feels like a restart might solve the issue.
Additionally:

  1. When did the alerts start? Was it after 12:08?
  2. Was TiCDC upgraded?
| username: TiDBer_yyy | Original post link

  1. I didn’t pay attention to when it started.
  2. ticdc is deployed directly by tiup later.

Boss, how do you restart capture? tiup cluster restart cluster_name -R cdc?

What impact does restarting capture have?

| username: jansu-dev | Original post link

Actually, the impact is not significant. After the restart, the connection between Capture and TiKV will reconnect, which will incur some overhead (rebuilding the connect stream for each Region). During this process, the synchronization progress will be somewhat affected, but other aspects should be fine. It’s better to restart during a business off-peak period.

| username: TiDBer_yyy | Original post link

After restarting the Capture on a single node in the offline environment, there is no ticdc_owner_status metric.

I’ll think of other solutions.

| username: TiDBer_yyy | Original post link

Complete capture monitoring through a Python script.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.