Cluster Upgrade: CDC Interrupted for 15 Minutes

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群升级cdc中断15分钟

| username: jaybing926

[TiDB Usage Environment] Production Environment
[TiDB Version]
v5.4.3 → v7.1.5
[Reproduction Path] What operations were performed when the issue occurred
Upgraded cluster version from v5.4.3 to v7.1.5, during the upgrade process, CDC data was interrupted for 15 minutes.

[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

CDC logs started reporting errors

The affected time range seems to be all these kinds of logs, too large to upload

| username: WalterWj | Original post link

The first log should be the PD leader switch, and the subsequent info might be related to the rolling leader switch of TiKV.

CDC needs to scan the TiKV raft logs, so it’s understandable that region leader switches and TiKV restarts could affect CDC tasks.

| username: jaybing926 | Original post link

Isn’t the leader supposed to switch proactively when it goes offline normally? As I understand it, shouldn’t PD know the current leader information? Why does it keep reporting the following errors? Even if the leader can’t be found, there should always be one found, right? The issue is that the CDC data is directly zero, which isn’t normal, right?

| username: WalterWj | Original post link

The proactive switching of the TiKV leader does not necessarily mean that the PD information is up-to-date. By default, TiKV should report leader information once every minute. If an error occurs, let TiKV report it proactively.

How should we understand the CDC data being directly 0? Could it be that the CDC CLI was not updated at that time, leading to issues with the query results… and it might actually be running?

| username: jaybing926 | Original post link

The actual data is 0. The logic is: ticdc writes data to the downstream Kafka, and through the monitoring of the Kafka topic, it is observed that the data producer for this topic is 0.

| username: WalterWj | Original post link

Grafana → CDC’s dashboard → TiKV’s Row → Three Initial scan panels

If it’s a scan, then these three panels will respond at 14:35.

If a scan occurs, it’s expected to be stuck.

| username: jaybing926 | Original post link

The image is not visible. Please provide the text you need translated.

| username: WalterWj | Original post link

Does it match the peak time?

| username: WalterWj | Original post link

The online upgrade appears to be as expected from the monitoring.

  1. Upgrading from 5.4.3 to 7.1.5, CDC does not support rolling upgrades. Synchronization will stop until all CDCs are upgraded, then it will resume.
  2. After the upgrade is completed, CDC needs time to perform incremental scans and catch up with the previously missed data. Therefore, there is a short period of increased data writes after 14:30.
| username: jaybing926 | Original post link

Awesome, boss! :kissing_heart::kissing_heart::kissing_heart:

| username: WinterLiu | Original post link

Awesome!

| username: 呢莫不爱吃鱼 | Original post link

Learning from the expert~!