TiCDC synchronization task has no errors, but downstream TSO does not advance

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc同步任务无报错,同步下游tso不推进

| username: chnage

[TiDB Usage Environment] Production Environment
[TiDB Version] ticdc
[Reproduction Path] Operations performed that led to the issue
ticdc checkpointTs is not advancing. Attempted pause and resume but it did not recover. Attempted to restart the cdc component using tiup but it did not recover. Cleared the sync task and recreated the sync task but it still did not recover. There are no task errors in the sync task.

[Problem Encountered: Symptoms and Impact]


![wecom-temp-438014f2f992dd70babfb317ba9711c3|690x149]
(upload://yXYwaNrxl4kpdqhxjYXtFoEaFc6.png)

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: xfworld | Original post link

Check the CDC logs to see what the specific error is.

| username: chnage | Original post link

The image is not visible. Please provide the text you need translated.

| username: xfworld | Original post link

Check if there are any regions in the cluster that are in an orphaned state… The cluster’s status needs to be restored before it can be used.

The log describes the region’s identifier, and you can use the identifier to check.

| username: chnage | Original post link

It works.

| username: chnage | Original post link

It seems there is a leader.

| username: xfworld | Original post link

The image is not visible. Please provide the text you need translated.

| username: chnage | Original post link

There are two regions that were not found.

| username: xfworld | Original post link

That means the state is inconsistent, which is the reason why CDC is not working.

First, fix the cluster’s state. Check issues related to regions such as empty, miss, peer, etc. A thorough investigation is needed.

| username: chnage | Original post link

How can I fix this situation? Is there a reference document link?

| username: xfworld | Original post link

Refer to this:

I don’t know what operations the cluster has undergone to result in this situation.
You can only try to make up for it as much as possible. If you find that it’s not a replica loss but a replica failure, you can manually delete it. (It is recommended to back up before performing this operation)

| username: chnage | Original post link

The cluster is currently in an available state, but there are some concurrent writes in the business, and there are quite a few transaction conflicts. I’m not sure if this has an impact.

| username: chnage | Original post link

Thank you, boss.

| username: chnage | Original post link

After re-backing up and importing, CDC synchronized some data. Currently, it is not advancing TSO, but the error reported is different.

| username: xfworld | Original post link

ticdc and tidb are compatible with major versions, right? It’s best to use the same version…

The error described in the logs is basically related to PD, unable to save checkpoint…

| username: chnage | Original post link

The same, both are 4.0.16

| username: Min_Chen | Original post link

Hello, according to the logs, the upstream actively closed the connection. Please check the upstream TiKV logs.

| username: chnage | Original post link

[2022/11/16 10:58:09.759 +08:00] [Error] [router.rs:174] [“failed to send significant msg”] [msg=LeaderCallback(Callback::Read(…))]
[2022/11/16 10:59:46.419 +08:00] [Error] [router.rs:174] [“failed to send significant msg”] [msg=“CaptureChange { cmd: RegisterObserver { observe_id: ObserveID(2564262), region_id: 1255104, enabled: true }, region_epoch: conf_ver: 27247 version: 6891, callback: Callback::Read(…) }”]
[2022/11/16 11:00:50.238 +08:00] [Error] [router.rs:174] [“failed to send significant msg”] [msg=“CaptureChange { cmd: RegisterObserver { observe_id: ObserveID(2564263), region_id: 1247613, enabled: true }, region_epoch: conf_ver: 422 version: 7259, callback: Callback::Read(…) }”]
[2022/11/16 11:06:15.951 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=7629]
[2022/11/16 11:06:16.083 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=7631]
[2022/11/16 11:06:16.111 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=7630]
[2022/11/16 11:13:01.586 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=7944]
[2022/11/16 11:17:40.535 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=12664]
[2022/11/16 11:17:40.649 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=12655]

| username: chnage | Original post link

At present, there are still region_not_found issues. Not sure if this is the cause.

| username: Min_Chen | Original post link

Hello, please provide the complete TiKV logs corresponding to the time when the ticdc error occurred.