CDC Sync Data to Kafka Terminated

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: cdc同步数据到kafka中止

| username: TiDBer_AG1WKKha

[TiDB Usage Environment] Production Environment
[TiDB Version] 4.0.6
[Reproduction Path] No path leading to the issue found
[Encountered Issue: Problem Phenomenon and Impact]
Issue: cdc cannot synchronize data, tidb 4.0.6, ticdc 4.0.14, kafka topic is normal;
Encountered Issue: Unable to synchronize data, tso also stops progressing, cdc restart, task rebuild also do not synchronize, unless the tables related to the timeout region or not leader region mentioned in the cdc log are excluded from the cdc task, synchronization resumes (the tidb cluster itself reads and writes data normally)
Statistics of related region_id:

1772637 table1

1012586 table1

633916 table1

404133 table2

411363 table2

1218954 table3

Details: When there are related region logs in the log, there are two phenomena, one is a large delay, and the other is that tso may not progress at all. Excluding the tables related to the region, tso can progress again, and data can be synchronized.
cdc log, part of the log is intercepted, most of it is repetitive:
[2022/12/16 06:20:32.259 +00:00] [INFO] [client.go:1347] [“region state entry will be replaced because received message of newer requestID”] [regionID=1772637] [oldRequestID=130] [requestID=131] [addr=ip1:20160]
[2022/12/16 06:20:31.793 +00:00] [INFO] [client.go:860] [“start new request”] [request=“{"header":{"cluster_id":6929472922985843426,"ticdc_version":"4.0.14"},"region_id":1772637,"region_epoch":{"conf_ver":38117,"version":380},"checkpoint_ts":438087496947204097,"start_key":"dIAAAAAAAAH/cV9yAAAAAAD6","end_key":"dIAAAAAAAAH/cV9ykAAAAAD/BLihAAAAAAD6","request_id":131,"extra_op":1}”] [addr=ip1:20160]
[2022/12/16 06:20:31.793 +00:00] [INFO] [region_range_lock.go:218] [“range locked”] [lockID=36] [regionID=1772637] [startKey=7480000000000001ff715f720000000000fa] [endKey=7480000000000001ff715f729000000000ff04b8a10000000000fa] [checkpointTs=438087496947204097]
[2022/12/16 06:20:31.790 +00:00] [INFO] [region_range_lock.go:370] [“unlocked range”] [lockID=36] [regionID=1772637] [startKey=7480000000000001ff715f720000000000fa] [endKey=7480000000000001ff715f729000000000ff04b8a10000000000fa] [checkpointTs=438087496947204097]
[2022/12/16 06:20:31.790 +00:00] [INFO] [client.go:1041] [“EventFeed disconnected”] [regionID=1772637] [requestID=130] [span=“[7480000000000001ff715f720000000000fa, 7480000000000001ff715f729000000000ff04b8a10000000000fa)”] [checkpoint=438087496947204097] [error=“[CDC:ErrEventFeedEventError]region_not_found:<region_id:1772637 > “]
[2022/12/16 06:20:31.790 +00:00] [INFO] [client.go:1347] [“region state entry will be replaced because received message of newer requestID”] [regionID=1772637] [oldRequestID=122] [requestID=130] [addr=ip1:20160]
[2022/12/16 06:20:31.787 +00:00] [INFO] [client.go:860] [“start new request”] [request=”{"header":{"cluster_id":6929472922985843426,"ticdc_version":"4.0.14"},"region_id":1772637,"region_epoch":{"conf_ver":38115,"version":380},"checkpoint_ts":438087496947204097,"start_key":"dIAAAAAAAAH/cV9yAAAAAAD6","end_key":"dIAAAAAAAAH/cV9ykAAAAAD/BLihAAAAAAD6","request_id":130,"extra_op":1}”] [addr=ip1:20160]
[2022/12/16 06:20:31.787 +00:00] [INFO] [region_range_lock.go:218] [“range locked”] [lockID=36] [regionID=1772637] [startKey=7480000000000001ff715f720000000000fa] [endKey=7480000000000001ff715f729000000000ff04b8a10000000000fa] [checkpointTs=438087496947204097]
[2022/12/16 06:20:31.787 +00:00] [INFO] [region_cache.go:829] [“switch region peer to next due to NotLeader with NULL leader”] [currIdx=0] [regionID=1772637]
[2022/12/16 06:20:31.787 +00:00] [INFO] [region_range_lock.go:370] [“unlocked range”] [lockID=36] [regionID=1772637] [startKey=7480000000000001ff715f720000000000fa] [endKey=7480000000000001ff715f729000000000ff04b8a10000000000fa] [checkpointTs=438087496947204097]
[2022/12/16 06:20:31.787 +00:00] [INFO] [client.go:1041] [“EventFeed disconnected”] [regionID=1772637] [requestID=129] [span=“[7480000000000001ff715f720000000000fa, 7480000000000001ff715f729000000000ff04b8a10000000000fa)”] [checkpoint=438087496947204097] [error=“[CDC:ErrEventFeedEventError]not_leader:<region_id:1772637 > “]
[2022/12/16 06:20:31.787 +00:00] [INFO] [client.go:1347] [“region state entry will be replaced because received message of newer requestID”] [regionID=1772637] [oldRequestID=123] [requestID=129] [addr=172.23.50.156:20160]
[2022/12/16 06:20:31.785 +00:00] [INFO] [client.go:860] [“start new request”] [request=”{"header":{"cluster_id":6929472922985843426,"ticdc_version":"4.0.14"},"region_id":1772637,"region_epoch":{"conf_ver":38115,"version":380},"checkpoint_ts":438087496947204097,"start_key":"dIAAAAAAAAH/cV9yAAAAAAD6","end_key":"dIAAAAAAAAH/cV9ykAAAAAD/BLihAAAAAAD6","request_id":129,"extra_op":1}”] [addr=172.23.50.156:20160]
[2022/12/16 06:20:31.784 +00:00] [INFO] [region_range_lock.go:218] [“range locked”] [lockID=36] [regionID=1772637] [startKey=7480000000000001ff715f720000000000fa] [endKey=7480000000000001ff715f729000000000ff04b8a10000000000fa] [checkpointTs=438087496947204097]
[2022/12/16 06:20:31.784 +00:00] [INFO] [region_cache.go:829] [“switch region peer to next due to NotLeader with NULL leader”] [currIdx=2] [regionID=1772637]
[2022/12/16 06:20:31.784 +00:00] [INFO] [region_range_lock.go:370] [“unlocked range”] [lockID=36] [regionID=1772637] [startKey=7480000000000001ff715f720000000000fa] [endKey=7480000000000001ff715f729000000000ff04b8a10000000000fa] [checkpointTs=438087496947204097]
[2022/12/16 06:20:31.784 +00:00] [INFO] [client.go:1041] [“EventFeed disconnected”] [regionID=1772637] [requestID=125] [span=“[7480000000000001ff715f720000000000fa, 7480000000000001ff715f729000000000ff04b8a10000000000fa)”] [checkpoint=438087496947204097] [error=“[CDC:ErrEventFeedEventError]not_leader:<region_id:1772637 > “]
[2022/12/16 06:20:30.561 +00:00] [INFO] [lock_resolver.go:122] [“resolve lock successfully”] [regionID=1012517] [maxVersion=438087613208068096]
[2022/12/16 06:20:30.558 +00:00] [WARN] [client.go:1531] [“region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock”] [regionID=1012517] [span=”[7480000000000001ffd25f720000000000fa, 7480000000000001ffd25f730000000000fa)”] [duration=39m55.35s] [lastEvent=39m55.000141463s] [resolvedTs=438086987902877697]
[2022/12/16 06:20:30.557 +00:00] [WARN] [client.go:1515] [“region not receiving event from tikv for too long time”] [regionID=1012517] [span=“[7480000000000001ffd25f720000000000fa, 7480000000000001ffd25f730000000000fa)”] [duration=39m55.000141463s]
[2022/12/16 06:20:28.982 +00:00] [INFO] [statistics.go:126] [“sink replication status”] [name=MQ] [changefeed=ticdc-dept-task] [capture=172.23.51.40:8300] [count=0] [qps=0]
[2022/12/16 06:20:28.880 +00:00] [INFO] [lock_resolver.go:122] [“resolve lock successfully”] [regionID=1772637] [maxVersion=438087612683780096]
[2022/12/16 06:20:28.877 +00:00] [WARN] [client.go:1531] [“region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock”] [regionID=1772637] [span=“[7480000000000001ff715f720000000000fa, 7480000000000001ff715f729000000000ff04b8a10000000000fa)”] [duration=7m31.5s] [lastEvent=3m18.901974009s] [resolvedTs=438087496947204097]
[2022/12/16 06:20:28.877 +00:00] [WARN] [client.go:1515] [“region not receiving event from tikv for too long time”] [regionID=1772637] [span=“[7480000000000001ff715f720000000000fa, 7480000000000001ff715f729000000000ff04b8a10000000000fa)”] [duration=3m18.901974009s]

[2022/12/16 07:06:05.911 +00:00] [INFO] [region_range_lock.go:218] [“range locked”] [lockID=36] [regionID=1772637] [startKey=7480000000000001ff715f720000000000fa] [endKey=7480000000000001ff715f729000000000ff04b8a10000000000fa] [checkpointTs=438088323251306497]
[2022/12/16 07:06:05.911 +00:00] [INFO] [client.go:860] [“start new request”] [request=“{"header":{"cluster_id":6929472922985843426,"ticdc_version":"4.0.14"},"region_id":1772637,"region_epoch":{"conf_ver":38135,"version":380},"checkpoint_ts":438088323251306497,"start_key":"dIAAAAAAAAH/cV9yAAAAAAD6","end_key":"dIAAAAAAAAH/cV9ykAAAAAD/BLihAAAAAAD6","request_id":198,"extra_op":1}”] [addr=172.23.50.154:20160]
[2022/12/16 07:06:05.914 +00:00] [INFO] [client.go:1347] [“region state entry will be replaced because received message of newer requestID”] [regionID=1772637] [oldRequestID=196] [requestID=198] [addr=172.23.50.154:20160]
[2022/12/16 07:06:05.914 +00:00] [INFO] [client.go:1041] [“EventFeed disconnected”] [regionID=1772637] [requestID=198] [span=“[7480000000000001ff715f720000000000fa, 7480000000000001ff715f729000000000ff04b8a10000000000fa)”] [checkpoint=438088323251306497] [error=“[CDC:ErrEventFeedEventError]not_leader:<region_id:1772637 leader:<id:2108428 store_id:6 > > “]
[2022/12/16 07:06:05.914 +00:00] [INFO] [region_range_lock.go:370] [“unlocked range”] [lockID=36] [regionID=1772637] [startKey=7480000000000001ff715f720000000000fa] [endKey=7480000000000001ff715f729000000000ff04b8a10000000000fa] [checkpointTs=438088323251306497]
[2022/12/16 07:06:05.914 +00:00] [INFO] [region_cache.go:842] [“switch region leader to specific leader due to kv return NotLeader”] [regionID=1772637] [currIdx=1] [leaderStoreID=6]
[2022/12/16 07:06:05.914 +00:00] [INFO] [region_range_lock.go:218] [“range locked”] [lockID=36] [regionID=1772637] [startKey=7480000000000001ff715f720000000000fa] [endKey=7480000000000001ff715f729000000000ff04b8a10000000000fa] [checkpointTs=438088323251306497]
[2022/12/16 07:06:05.914 +00:00] [INFO] [client.go:860] [“start new request”] [request=”{"header":{"cluster_id":6929472922985843426,"ticdc_version":"4.0.14"},"region_id":1772637,"region_epoch":{"conf_ver":38135,"version":380},"checkpoint_ts":438088323251306497,"start_key":"dIAAAAAAAAH/cV9yAAAAAAD6","end_key":"dIAAAAAAAAH/cV9ykAAAAAD/BLihAAAAAAD6","request_id":199,"extra_op":1}”] [addr=172.23.50.155:20160]
[2022/12/16 07:06:06.349 +00:00] [INFO] [client.go:1347] [“region state entry will be replaced because received message of newer requestID”] [regionID=1772637] [oldRequestID=181] [requestID=199] [addr=172.23.50.155:20160]
[2022/12/16 07:06:07.185 +00:00] [WARN] [client.go:1515] [“region not receiving event from tikv for too long time”] [regionID=633916] [span=“[7480000000000001ff715f72c000000000ff01f9ae0000000000fa, 7480000000000001ff715f730000000000fa)”] [duration=2m21.807617222s]
[2022/12/16 07:06:07.185 +00:00] [WARN] [client.go:1531] [“region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock”] [regionID=633916] [span=“[7480000000000001ff715f72c000000000ff01f9ae0000000000fa, 7480000000000001ff715f730000000000fa)”] [duration=4m25.35s] [lastEvent=2m21.807617222s] [resolvedTs=438088263233437697]
[2022/12/16 07:06:07.187 +00:00] [INFO] [lock_resolver.go:122] [“resolve lock successfully”] [regionID=633916] [maxVersion=438088330171908096]
[2022/12/16 07:06:10.089 +00:00] [INFO] [client.go:1041] [“EventFeed disconnected”] [regionID=633

| username: Min_Chen | Original post link

Hello,
Please follow the steps below to troubleshoot:

  1. Check the TiCDC logs for warnings containing the following keywords, and if there are specific region IDs and resolved timestamps printed, you can use pd-ctl tso to confirm the approximate time when it got stuck: “The resolvedTs is fallen back in kvclient” “region not receiving resolved event from tikv or resolved ts is not pushing for too long time, try to resolve lock”

  2. Check all TiKV logs based on the warning region ID, and look for the last register and deregister log information:
    grep 2256981 * | grep register

[2021/05/15 23:48:22.801 +08:00] [INFO] [endpoint.rs:484] [ "cdc register region" ] [downstream_id=DownstreamID(2020)] [req_id=87945] [conn_id=ConnID(98)] [region_id=2256981]

[2021/05/16 05:28:48.574 +08:00] [INFO] [endpoint.rs:351] [ "cdc deregister" ] [deregister= "Deregister { deregister: " region ", region_id: 2256981, observe_id: ObserveID(1887), err: Request(message: " peer is not leader for region 2256981, leader may None " not_leader { region_id: 2256981 }) }" ]

  1. Check the logs for the last schedule resolver of this region ID:
    grep 2256981 * | grep 'schedule resolver ready'

[2021/05/15 23:48:21.777 +08:00] [INFO] [endpoint.rs:1228] [ "cdc resolver initialized and schedule resolver ready" ] [takes=176ms] [observe_id=ObserveID(1887)] [lock_count=0] [resolved_ts=0] [downstream_id=DownstreamID(1887)] [conn_id=ConnID(36)] [region_id=2256981]

  1. If the last register time in step 2 is greater than the schedule resolver time in step 3, it indicates that TiKV is stuck.
| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.