CDC Reports ErrRegionsNotCoverSpan Error and Downstream Write Exception

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: cdc报ErrRegionsNotCoverSpan错误下游写入异常

| username: h5n1

Version: v5.2.3 arm
Issue:
At around 14:00, CDC encountered an exception. Upon inspection, downstream writes were found to be 0, and the cdc cli reported the following error:

"code": "CDC:ErrRegionsNotCoverSpan",
"message": "[CDC:ErrReachMaxTry]reach maximum try: 100: [CDC:ErrRegionsNotCoverSpan]regions not completely left cover span, span [748000000000009fffa15f72800000001affb9c99b0000000000fa, 748000000000009fffa15f730000000000fa) regions: [id:1671509411 start_key:\"t\\200\\000\\000\\000\\000\\000\\237\\377\\241_r\\200\\000\\000\\000\\032\\377\\273\\013\\307\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\237\\377\\254_i\\200\\000\\000\\000\\000\\377\\000\\000\\001\\00100-0\\3772592\\377003\\377391_8\\3771_\\377010806\\3770\\3775681895\\377\\37725_81140\\377\\3770000032\\377_\\377010810\\37747\\377472_\\000\\377\\000\\000\\000\\373\\003\\200\\000\\000\\377\\000\\0014\\213\\345\\000\\000\\000\\374\" region_epoch:\u003cconf_ver:80829 version:132267 \u003e peers:\u003cid:1675591530 store_id:11107291 \u003e peers:\u003cid:1675593220 store_id:56722 \u003e peers:\u003cid:1675595947 store_id:11107657 \u003e ]"

Based on the error message, I found the following known issue: ScanRegions total retry time is too short · Issue #5230 · pingcap/tiflow (github.com). This issue might not be fixed in version 5.2, so I restarted the CDC, and it returned to normal. After a while, it resumed writing data downstream. Around 18:00, I checked again and found that downstream writes were intermittent. After two more restarts, it returned to normal, but after a while, it became abnormal again.

After the last two CDC restarts, checking the changefeed status still showed the same error.

After waiting for CDC to start writing data downstream, checking the CDC status still showed the above error. It seems that the downstream write issue is not related to this error.

146 node logs
cdc.log.tar.gz (13.3 MB) cdc-2022-09-23T20-28-40.335.log.gz (18.3 MB)

151 node logs
cdc-2022-09-23T13-36-18.561.log.gz (17.1 MB) cdc-2022-09-23T17-52-35.450.log.gz (19.2 MB)

152 node logs
cdc-2022-09-23T17-32-17.523.log.gz (16.9 MB)

| username: jansu-dev | Original post link

I looked at that issue, and it seems that after region split/merge, the information in PD becomes outdated, causing CDC to repeatedly fail to get the latest region leader information from PD, and it reports an error after reaching the retry threshold. Are you testing it for fun?

| username: jansu-dev | Original post link

  1. This issue does indeed cause synchronization interruptions.
  2. It seems that until version 4.0.16, the retry threshold was not increased.
  3. If you are testing it for fun, I think you can follow the PR to fix it yourself, build a binary, and give it a try.
  4. Or just upgrade to v6.1.0, as version 6.1 has fixed it.

This should be the problem, the error message is quite clear (PS: I didn’t look at the provided log :rofl:).

| username: h5n1 | Original post link

After restarting last night, the downstream write was interrupted for a while and then recovered on its own. However, the CDC CLI still reports this error, and it is always the same region_id.

Is it feasible to use this to clean up PD region information?

curl -X DELETE http://${HostIP}:2379/pd/api/v1/admin/cache/region/{region_id}

Another question: TiCDC pulls logs, while monitoring regions is done by the TiKV CDC component. Is this error message reported by TiKV CDC to TiCDC, or is it actively requested or checked by TiCDC?

| username: yilong | Original post link

  1. Why is the status of this region id always voter? Check the historical information of this region_id=1671509411 in tikv.log. When did the problem first occur?
  2. Has it been unable to elect a leader all along?
| username: h5n1 | Original post link

The output results should all be voters, and there is leader information below.

The error below should be a bug.

| username: 像风一样的男子 | Original post link

I upgraded to 5.4.3 and also encountered this error. How can I solve it?

| username: h5n1 | Original post link

The issue hasn’t been resolved yet, it’s just running like this for now.

| username: 像风一样的男子 | Original post link

My process is stuck, and the checkpoint is not progressing.

| username: h5n1 | Original post link

Restart the CDC.

| username: 像风一样的男子 | Original post link

I’ve restarted everything, deleted and recreated the tasks, restarted the cluster, upgraded the version from 4.0.9 to 5.4.3 and then to 6.1.1, but nothing worked.

| username: h5n1 | Original post link

Has this error always been present? Check the regions with pd-ctl region to see the status of the problematic regions.

| username: 像风一样的男子 | Original post link

I am using a single replica for this slave.

| username: h5n1 | Original post link

Try using the latest version 5.3.

| username: 像风一样的男子 | Original post link

After the version upgrade, Kafka encountered an abnormal connection closure. What could be the reason for this?