TiCDC Error, Error Code CDC-owner-1001

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiCDC报错,错误码CDC-owner-1001

| username: EricSong

[TiDB Usage Environment]
Production
[TiDB Version]

TiCDC Version
Release Version: v4.0.11
Git Commit Hash: 52a6d9ea6da595b869a43e13ae2d3680354f89b8
Git Branch: heads/refs/tags/v4.0.11
UTC Build Time: 2021-02-25 16:40:37
Go Version: go version go1.13 linux/amd64

[Encountered Issue: Problem Phenomenon and Impact]
The task is a circular synchronization task between two clusters A and B, where B is the backup cluster. Currently, there are no write operations on B, and B cannot be tested temporarily, so it is impossible to know whether the synchronization task is still normal. The changefeed can be created successfully, but after running for a while, it changes to a normal state but with the following error message:

"state": "normal",
  "history": [
    1679004242947
  ],
  "error": {
    "addr": "10.241.200.238:8300",
    "code": "CDC-owner-1001",
    "message": "rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader"
  },

[Resource Configuration]
Two CDC nodes
[Attachments: Screenshots/Logs/Monitoring]
According to the logs, the issue is caused by a failure in the TiKV RPC connection. How can this problem be fixed?

[2023/03/25 18:09:34.923 +00:00] [INFO] [client.go:726] ["creating new stream to store to send request"] [regionID=126366642] [requestID=4042] [storeID=65151616] [addr=10.250.78.96:20160]
[2023/03/25 18:09:34.924 +00:00] [INFO] [client.go:398] ["establish stream to store failed, retry later"] [addr=10.250.78.96:20160] [error="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection closed"] [errorVerbose="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection closed\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc@/pkg/errors/helper.go:28\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:397\ngithub.com/pingcap/ticdc/pkg/retry.Run.func1\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:32\ngithub.com/cenkalti/backoff.RetryNotify\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:37\ngithub.com/cenkalti/backoff.Retry\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:24\ngithub.com/pingcap/ticdc/pkg/retry.Run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:31\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:375\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:731\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:521\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
[2023/03/25 18:09:34.924 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126366089] [startKey=7480000000000089ff1a5f728000000000ff0393680000000000fa] [endKey=7480000000000089ff1a5f728000000000ff03991c0000000000fa] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.925 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126245263] [startKey=7480000000000089ff1a5f698000000000ff0000030419ae9a3aff8100000003800000ff000002edd7000000fc] [endKey=7480000000000089ff1a5f698000000000ff0000040419ae98a3ffa900000003800000ff00000089b0000000fc] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.925 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126366891] [startKey=7480000000000089ff1a5f698000000000ff0000040419ae9928fff800000003800000ff000002dba9000000fc] [endKey=7480000000000089ff1a5f698000000000ff0000040419ae9a58ff0200000003800000ff0000001a17000000fc] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.926 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126352778] [startKey=7480000000000089ff1a5f728000000000ff03991c0000000000fa] [endKey=7480000000000089ff1a5f728000000000ff039d430000000000fa] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.927 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126366128] [startKey=7480000000000089ff1a5f728000000000ff039d430000000000fa] [endKey=7480000000000089ff1a5f728000000000ff03a4180000000000fa] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.927 +00:00] [INFO] [region_range_lock.go:217] ["range locked"] [lockID=150] [regionID=126245240] [startKey=7480000000000089ff1a5f728000000000ff03dd180000000000fa] [endKey=7480000000000089ff1a5f728000000000ff03e5a20000000000fa] [checkpointTs=440341039823454248]
[2023/03/25 18:09:34.929 +00:00] [INFO] [client.go:398] ["establish stream to store failed, retry later"] [addr=10.250.78.96:20160] [error="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.250.78.96:20160: connect: connection refused\""] [errorVerbose="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.250.78.96:20160: connect: connection refused\"\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc@/pkg/errors/helper.go:28\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:397\ngithub.com/pingcap/ticdc/pkg/retry.Run.func1\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:32\ngithub.com/cenkalti/backoff.RetryNotify\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:37\ngithub.com/cenkalti/backoff.Retry\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:24\ngithub.com/pingcap/ticdc/pkg/retry.Run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry.go:31\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:375\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:731\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:521\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
[2023/03/25 18:09:34.929 +00:00] [WARN] [client.go:734] ["get grpc stream client failed"] [regionID=125931395] [requestID=4029] [storeID=65151616] [error="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.250.78.96:20160: connect: connection refused\""]
| username: TiDBer_pkQ5q1l0 | Original post link

Is ring synchronization deployed with TiCDC on both clusters A and B?

| username: EricSong | Original post link

Yes, currently the synchronization from A to B is normal, but there might be some issues with B’s synchronization. However, according to the logs, it seems to be caused by a failure in B’s TiKV RPC connection, which should be unrelated to the characteristics of circular synchronization. At the moment, I’m not sure if the task is still running normally and how to recover it.

| username: tidb菜鸟一只 | Original post link

A synchronizes to B, and then B synchronizes back to A?

| username: liuis | Original post link

Why do backup clusters need to be synchronized with the main cluster?

| username: EricSong | Original post link

Yes, I have already followed the official documentation to set up circular replication filtering. This way, B will not replicate the data it received from A back to A.

The purpose of this design is: under normal circumstances, the data produced by A during its operation is synchronized to B. If A encounters an issue, we switch to B, making B the primary cluster. The data generated by B is then synchronized to A, ensuring data consistency at all times.

| username: EricSong | Original post link

The purpose of this design is: under normal circumstances, data produced by A is synchronized to B. If A encounters an issue, it switches to B, making B the primary cluster. The data produced at this point is synchronized to A, ensuring data consistency at all times.

| username: liuis | Original post link

Got it, but I always feel that doing it this way might cause problems, and the official documentation also states that this circular backup is an experimental feature.

| username: EricSong | Original post link

Yes, this has already been explained to the service being used, and they understand it. However, the current error doesn’t seem to be caused by cyclic replication; it looks more like a CDC owner or TiKV RPC error.