TiCDC synchronization task has no errors, but downstream TSO does not advance

translator_bot · June 22, 2024, 11:48pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: ticdc同步任务无报错，同步下游tso不推进

| username: chnage

[TiDB Usage Environment] Production Environment
[TiDB Version] ticdc
[Reproduction Path] Operations performed that led to the issue
ticdc checkpointTs is not advancing. Attempted pause and resume but it did not recover. Attempted to restart the cdc component using tiup but it did not recover. Cleared the sync task and recreated the sync task but it still did not recover. There are no task errors in the sync task.

[Problem Encountered: Symptoms and Impact]

![wecom-temp-438014f2f992dd70babfb317ba9711c3|690x149]
(upload://yXYwaNrxl4kpdqhxjYXtFoEaFc6.png)

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

translator_bot · June 22, 2024, 11:48pm

| username: xfworld | Original post link

Check the CDC logs to see what the specific error is.

translator_bot · June 22, 2024, 11:48pm

| username: chnage | Original post link

The image is not visible. Please provide the text you need translated.

translator_bot · June 22, 2024, 11:48pm

| username: xfworld | Original post link

Check if there are any regions in the cluster that are in an orphaned state… The cluster’s status needs to be restored before it can be used.

The log describes the region’s identifier, and you can use the identifier to check.

translator_bot · June 22, 2024, 11:48pm

| username: chnage | Original post link

It works.

translator_bot · June 22, 2024, 11:48pm

| username: chnage | Original post link

It seems there is a leader.

translator_bot · June 22, 2024, 11:48pm

| username: xfworld | Original post link

The image is not visible. Please provide the text you need translated.

translator_bot · June 22, 2024, 11:48pm

| username: chnage | Original post link

There are two regions that were not found.

translator_bot · June 22, 2024, 11:48pm

| username: xfworld | Original post link

That means the state is inconsistent, which is the reason why CDC is not working.

First, fix the cluster’s state. Check issues related to regions such as empty, miss, peer, etc. A thorough investigation is needed.

translator_bot · June 22, 2024, 11:48pm

| username: chnage | Original post link

How can I fix this situation? Is there a reference document link?

translator_bot · June 22, 2024, 11:48pm

| username: xfworld | Original post link

Refer to this:

I don’t know what operations the cluster has undergone to result in this situation.
You can only try to make up for it as much as possible. If you find that it’s not a replica loss but a replica failure, you can manually delete it. (It is recommended to back up before performing this operation)

translator_bot · June 22, 2024, 11:48pm

| username: chnage | Original post link

The cluster is currently in an available state, but there are some concurrent writes in the business, and there are quite a few transaction conflicts. I’m not sure if this has an impact.

translator_bot · June 22, 2024, 11:48pm

| username: chnage | Original post link

Thank you, boss.

translator_bot · June 22, 2024, 11:48pm

| username: chnage | Original post link

After re-backing up and importing, CDC synchronized some data. Currently, it is not advancing TSO, but the error reported is different.

translator_bot · June 22, 2024, 11:48pm

| username: xfworld | Original post link

ticdc and tidb are compatible with major versions, right? It’s best to use the same version…

The error described in the logs is basically related to PD, unable to save checkpoint…

translator_bot · June 22, 2024, 11:48pm

| username: chnage | Original post link

The same, both are 4.0.16

translator_bot · June 22, 2024, 11:48pm

| username: Min_Chen | Original post link

Hello, according to the logs, the upstream actively closed the connection. Please check the upstream TiKV logs.

translator_bot · June 22, 2024, 11:48pm

| username: chnage | Original post link

[2022/11/16 10:58:09.759 +08:00] [Error] [router.rs:174] [“failed to send significant msg”] [msg=LeaderCallback(Callback::Read(…))]
[2022/11/16 10:59:46.419 +08:00] [Error] [router.rs:174] [“failed to send significant msg”] [msg=“CaptureChange { cmd: RegisterObserver { observe_id: ObserveID(2564262), region_id: 1255104, enabled: true }, region_epoch: conf_ver: 27247 version: 6891, callback: Callback::Read(…) }”]
[2022/11/16 11:00:50.238 +08:00] [Error] [router.rs:174] [“failed to send significant msg”] [msg=“CaptureChange { cmd: RegisterObserver { observe_id: ObserveID(2564263), region_id: 1247613, enabled: true }, region_epoch: conf_ver: 422 version: 7259, callback: Callback::Read(…) }”]
[2022/11/16 11:06:15.951 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=7629]
[2022/11/16 11:06:16.083 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=7631]
[2022/11/16 11:06:16.111 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=7630]
[2022/11/16 11:13:01.586 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=7944]
[2022/11/16 11:17:40.535 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=12664]
[2022/11/16 11:17:40.649 +08:00] [Error] [endpoint.rs:1113] [“cdc send scan event failed”] [req_id=12655]

translator_bot · June 22, 2024, 11:48pm

| username: chnage | Original post link

At present, there are still region_not_found issues. Not sure if this is the cause.

translator_bot · June 22, 2024, 11:48pm

| username: Min_Chen | Original post link

Hello, please provide the complete TiKV logs corresponding to the time when the ticdc error occurred.