TiCDC Task Unable to Synchronize Properly (etcd client outCh blocking too long, the etcdWorker may be stuck)

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC任务无法正常同步(etcd client outCh blocking too long, the etcdWorker may be stuck)

| username: 雪落香杉树

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] v5.4.0
[Reproduction Path] What operations were performed when the issue occurred

[Encountered Issue: Issue Phenomenon and Impact]
After restarting the TiCDC task due to delay, the cdc log continuously reports
[2023/12/17 10:19:29.997 +08:00] [WARN] [client.go:226] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=1m31.999185773s]

Deleting the task and recreating it results in the same error. Restarting the cdc and pd nodes did not resolve the issue.
[Resource Configuration]
16 cores 64GB
[Attachments: Screenshots/Logs/Monitoring]
The earliest occurrence was around 6:45, CDC exited

Currently, deleting and recreating the task still does not work

According to online sources, this is a bug. Seeking help from experts, how can this be handled in the short term?

| username: Jellybean | Original post link

Is the cluster access normal at the moment? Please confirm the latency and QPS.

Based on the error messages, there might be some issues with PD etcd.

Please check if there are any anomalies in the PD monitoring panel, and also check the etcd usage space in the TiCDC monitoring panel.

| username: 雪落香杉树 | Original post link



| username: 雪落香杉树 | Original post link

It seems that a node is missing here, but it actually shows as up.

| username: 雪落香杉树 | Original post link

Access should be normal; currently, only ticdc has issues.

| username: 雪落香杉树 | Original post link

Previously, a task could be created in seconds, but now it takes 3 minutes. However, the task can be rebuilt now, but the log still shows the error: etcd client outCh blocking too long, the etcdWorker may be stuck.

| username: 雪落香杉树 | Original post link

I encountered many problems using TiCDC, and this is the first time I’ve encountered this particular issue. I couldn’t find the cause at all, but it has now been resolved. We are currently on version 5.4. Would upgrading to 5.4.3 improve things?

| username: 雪落香杉树 | Original post link

There is also a 5.4.3 and 6.1.0 online. The version updates are too fast, and new versions may also have new issues. We mostly use the TiCDC component.

| username: xfworld | Original post link

The cross-version capabilities of ticdc 5.x, 6.x, and 7.x vary significantly, and minor version upgrades will fix some bugs.

If possible, it is recommended to upgrade to a minor version… This can avoid issues caused by some bugs.

Currently, the latest version of ticdc in the 6.1.x series is relatively stable and has fixed several critical bugs.

| username: 雪落香杉树 | Original post link

Yes, preparing to upgrade to 5.4.3 next week, the last minor version of 5.4. It looks like this PR has already been merged and fixed: sink/mq(cdc): Fix mq flush worker deadlock by liuzix · Pull Request #4996 · pingcap/tiflow · GitHub

| username: 小龙虾爱大龙虾 | Original post link

There are some etcd-related warnings, but the CDC process crashed because it received an exit signal, right?

| username: dba远航 | Original post link

It feels like there is an anomaly in the ETCD election.

| username: 路在何chu | Original post link

Indeed, there are many inexplicable issues with the lower versions. I also resolved them by upgrading.

| username: oceanzhang | Original post link

Try upgrading and testing again. I remember there was a minor issue that was resolved by upgrading.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.