Question: A single table has more than 5000 regions. How does TiCDC handle the not_leader error event when it occurs?

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 求教:单表有5000多个region,在遇到not_leader事件时,TiCDC是如何处理该error事件的?

| username: 迷人的Ti

【TiDB Usage Environment】Production Environment
【TiDB Version】5.4.3
【Reproduction Path】I would like to ask, for a single table with more than 5000 regions, the frequency of error events in the task is about 6 per minute. As I understand it, TiCDC should re-fetch the CDC request for the region in the error event, and the fetched point should be at the checkpoint position. So, will the checkpoint always be delayed here because the region just fetched in the error is still in the incremental scan phase, the point cannot be updated, and then there is another error event, causing the overall task point to be continuously delayed?

Therefore, I have the following questions:

  • How does TiCDC handle CDC error events?

  • If the task delay is greater than 24 hours but still within the safepoint range, the incremental scan phase is particularly long, and there are many regions, how should this checkpoint delay be handled?

| username: 大飞哥online | Original post link

  1. To obtain an error, it should monitor changes in the TiKV cluster and then deliver them to the TiCDC working nodes.
  2. For network issues, it should wait or reconnect; for data issues, it should retry or interrupt and record the error.
| username: 大飞哥online | Original post link

Give more resources to the ticdc nodes. Check if the tikv regions are unevenly distributed, hotspots, etc.

| username: 迷人的Ti | Original post link

Thank you for your reply. I have identified that the reason for the particularly long incremental scan phase is because TiKV is slow in outputting data. Is there any way to optimize the speed?

Reason for identification: I did a pure record, and it took about 20 minutes from the start of subscribing to receiving the first resolveTs for over 2000 Regions. The amount of CDC data change is about 5 million, with no other business logic affecting it. This leads me to conclude that the output is slow.