How to Query TiCDC Synchronization Task Status in Prometheus

translator_bot · June 22, 2024, 9:55pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 如何在prometheus 中查询 TiCDC 同步任务状态

| username: TiDBer_yyy

[TiDB Usage Environment] Production Environment

[TiDB Version] 5.0.4

[Encountered Problem: Issue Phenomenon and Impact]
Original Requirement: Trigger an alert when proactively pausing the ticdc synchronization task changefeed.

Problem:
Checked Grafana and found a dashboard displaying “changefeed status,” but there is no data. Querying metrics also shows no data.

translator_bot · June 22, 2024, 9:55pm

| username: Lucien-卢西恩 | Original post link

Check if there is any data in Prometheus. The default platform port for Prometheus is 9090.

translator_bot · June 22, 2024, 9:55pm

| username: TiDBer_yyy | Original post link

Thank you for the reply:

There is no metric ticdc_owner_status in Prometheus
curl -i http://127.0.0.1:8300/metrics | grep ticdc_owner_status

image952×114 13.5 KB
Is there any other way to customize CDC alert rules to check the changefeed status?

translator_bot · June 22, 2024, 9:55pm

| username: jansu-dev | Original post link

Hello,

The current cluster has changefeeds, right?
Don’t use grep directly. Sometimes this method won’t find it, but using curl | less | /ticdc_owner_status can.
changefeedStatusGauge Normally, the owner updates the changefeed information periodically during Tick. Logically, either other monitoring also has issues, indicating a problem on the monitoring side, or there is an issue with the changefeed or capture owner, indicating a problem on the functionality synchronization side.
By the way, at any given time, a ticdc cluster can only have one owner. It is possible that this monitoring is on another capture. Check this monitoring on the owner capture, which might be the reason it wasn’t found.

translator_bot · June 22, 2024, 9:55pm

| username: TiDBer_yyy | Original post link

Check the synchronization status of the current cluster changefeed. Original requirement: Alert if the changefeed status is not normal.
There is no metric ticdc_owner_status in less.

3, 4 I don’t quite understand these two

translator_bot · June 22, 2024, 9:55pm

| username: jansu-dev | Original post link

Only the capture that is the owner will log this metric. Randomly selecting a capture might not be the capture owner. Therefore, looking at it this way might not be valuable. For the definition of owner, you can refer to → 专栏 - TiCDC系列分享-02-剖析同步模型与基本架构 | TiDB 社区

image1380×696 240 KB
I think it’s unlikely that there will be an issue here. If there is an issue, it likely indicates a problem with the synchronization functionality.

translator_bot · June 22, 2024, 9:55pm

| username: jansu-dev | Original post link

There is a dashboard in Grafana showing “changefeed status”, but there is no data, and there is no data when querying metrics.
If other metrics are present and only this metric is missing, it is likely not because Prometheus did not pull from the capture port. It is more likely that there was an issue with the owner election between captures, resulting in no metrics being generated. Are there any anomalies in the logs?

translator_bot · June 22, 2024, 9:55pm

| username: TiDBer_yyy | Original post link

Currently, there are two CDC nodes. After initiating a curl request, there are no ticdc_owner_status metrics.

curl 172.16.0.1:8300/metrics
curl 172.16.0.2:8300/metrics

Checking the CDC logs, the following errors are found,

[2022/12/08 00:31:29.256 +08:00] [ERROR] [owner.go:1354] ["watch owner campaign key failed, restart the watcher"] [error="etcdserver: mvcc: required revision has been compacted"]
[2022/12/08 00:31:29.288 +08:00] [WARN] [owner.go:1726] ["watch capture returned"] [error="[CDC:ErrOwnerEtcdWatch]etcdserver: mvcc: required revision has been compacted"] [errorVerbose="[CDC:ErrOwnerEtcdWatch]etcdserver: mvcc: required revision has been compacted\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc@/pkg/errors/helper.go:30\ngithub.com/pingcap/ticdc/cdc.(*Owner).watchCapture\n\tgithub.com/pingcap/ticdc@/cdc/owner.go:1636\ngithub.com/pingcap/ticdc/cdc.(*Owner).startCaptureWatcher.func1\n\tgithub.com/pingcap/ticdc@/cdc/owner.go:1713\nruntime.goexit\n\truntime/asm_amd64.s:1357"]

translator_bot · June 22, 2024, 9:55pm

| username: jansu-dev | Original post link

ErrEventFeedEventError theoretically should not interfere with the Owner election. Which Capture is currently shown as the Owner in the monitoring? Could you please upload the TiCDC panel and the Owner logs (for the recent period)?

translator_bot · June 22, 2024, 9:55pm

| username: TiDBer_yyy | Original post link

TiCDC Panel

Owner logs (the key has been sent to you via private message)
cdc.log.tar.gz (13.5 MB)

translator_bot · June 22, 2024, 9:55pm

| username: jansu-dev | Original post link

From the logs and panel, there doesn’t seem to be any interference with the normal synchronization function.
Secondly, tracing back from the update metrics, there are too many possibilities, and the logs don’t provide much useful advice

image1380×462 75.1 KB
There is one part in the logs that might be related to this, as shown below.

image1380×464 285 KB

In summary, the current suggestion is to restart all captures to see if it can be restored. It feels like a restart might solve the issue.
Additionally:

When did the alerts start? Was it after 12:08?
Was TiCDC upgraded?

translator_bot · June 22, 2024, 9:55pm

| username: TiDBer_yyy | Original post link

I didn’t pay attention to when it started.
ticdc is deployed directly by tiup later.

Boss, how do you restart capture? tiup cluster restart cluster_name -R cdc?

What impact does restarting capture have?

translator_bot · June 22, 2024, 9:55pm

| username: jansu-dev | Original post link

Actually, the impact is not significant. After the restart, the connection between Capture and TiKV will reconnect, which will incur some overhead (rebuilding the connect stream for each Region). During this process, the synchronization progress will be somewhat affected, but other aspects should be fine. It’s better to restart during a business off-peak period.

translator_bot · June 22, 2024, 9:55pm

| username: TiDBer_yyy | Original post link

After restarting the Capture on a single node in the offline environment, there is no ticdc_owner_status metric.

I’ll think of other solutions.

translator_bot · June 22, 2024, 9:55pm

| username: TiDBer_yyy | Original post link

Complete capture monitoring through a Python script.

translator_bot · June 22, 2024, 9:55pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.