Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 手动delete数据,会导致ticdc delay问题,这个具体是什么逻辑我还是没太明白,哪个老师帮我解释一下呀
As mentioned, my foundation is too weak and I don’t have time to thoroughly read the documentation. I’m asking for a teacher to explain it, thank you~
I want to understand the principles.
It’s quite simple. The delete SQL will affect TiKV, and TiCDC will capture the data change events in TiKV.
If there are a lot of rows being deleted, for example, 500,000 rows, which is quite common,
then TiCDC will also replay 500,000 row events to the downstream. At this point, two issues may arise:
- Whether the resources allocated for the TiCDC changefeed are sufficient to handle this scale.
- Whether the downstream service connected to TiCDC has enough resources to handle this scale.
If not, TiCDC will experience delays…
Is it just because the processing capacity of the TiCDC node is insufficient?
Does it have anything to do with GC compact?
If a single transaction processes a large amount of data, won’t other transactions have to wait? If the upstream has this operation, it will also cause DM synchronization delays or synchronization anomalies.
The operations of GC or compact are targeted at TiKV…
Both GC and compact will occupy TiKV’s resources.
As for the issue of large transactions, it should also be taken into account… This thing will consume a lot of memory… 
It seems that the new version has some optimizations for large transactions, allowing them to be executed in batches. I forgot which version it was.
DELETE is a DML command. If you want to delete the entire table, using the DROP DDL command is faster.
“Truncate table will be faster. Drop deletes the entire table, which is different…”
If truncate is used, does CDC only have one truncate command without other data being sent?
Batch deletion of historical data is indeed a big headache. If partitioned tables are used, there will be the headache of extremely long ANALYZE times:
Deleting a little bit each day is the most reasonable approach.
Indeed, DDL is faster than DML when it comes to deleting data.
This is indeed quite troublesome. Partition tables can solve the problem of deleting large amounts of historical data, but partition tables themselves have some issues, making it difficult to balance.
Does this mean that after importing incremental data every day, a portion of it is batch deleted?
Yes, taking it slowly has a smaller impact on cluster stability.