SOS CDC Needs Attention

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: SOS CDC 望重视

| username: porpoiselxj

Upgraded from version 4.0 all the way to 7.1.1, overall the database functions are basically okay. Encountered quite a few bugs but managed to find ways to avoid them. With new features added in the new versions, management and usage have become smoother. The only headache is CDC. Throughout the journey, various bugs have forced us to upgrade passively due to CDC issues. Here’s a summary:

  1. Various stalls with no feedback information, spamming INFO logs without error messages.
  2. Occasionally displayed errors are often unrelated, and sometimes you can only see the errors by pausing, making it difficult to pinpoint issues.
  3. Bulk data refreshes (e.g., changing the precision of a decimal type historically, which has been resolved in the new version).
  4. Horizontal impact, one changefeed gets stuck, affecting others as well.
  5. Pausing for various reasons, then unable to catch up, getting stuck, affecting other normal operations.
  6. CDC causing PD to restart continuously.

An excellent database, besides its core functions, should have stable incremental log extraction for business systems. For example, Oracle’s OGG and MySQL’s binlog are very stable. We haven’t tried TiDB’s binlog because the official documentation clearly states it will no longer be maintained and recommends using CDC. Although our business system data is on TiDB, CDC is unstable, constantly getting stuck. We can only skip data in handling, which is completely unacceptable in a production environment. So, even though TiDB has been online for nearly 2 years, we still dare not use it extensively, and cluster-level hot backup cannot be achieved.

Today, another issue arose. An abnormal node in downstream Kafka caused CDC to fail. After Kafka recovered (about an hour of interruption), resuming CDC resulted in incremental data being sent to Kafka. A spot check of several tables showed data had been sent to the latest state (not sure if all tables are up-to-date), but CDC’s TSO refused to advance. Restarting the changefeed and CDC didn’t help. The TSO just wouldn’t move. I thought creating a new one might work, so I created a new changefeed with the previous TSO, but the situation remained the same. Kafka received data, but the TSO wouldn’t move. Crucially, the changefeed status showed normal, and there were no error messages in the logs, leaving us at a loss…

| username: Billmay表妹 | Original post link

Do you have a corresponding issue link for the BUG you mentioned?

If it is a BUG, it can only be resolved case by case. Speaking in such general terms is a bit vague.

The teachers in the production and research team cannot take action.

It is recommended to follow the product defect template and provide feedback on each issue one by one. Try to report reproducible bugs that you encounter.

Bug Feedback
Clearly and accurately describe the issue you found. Providing any steps to reproduce the problem will help the R&D team address the issue promptly.
[TiDB Version]
[Impact of the Bug]

[Possible Steps to Reproduce the Issue]

[Observed Unexpected Behavior]

[Expected Behavior]

[Related Components and Specific Versions]

[Other Background Information or Screenshots]

| username: porpoiselxj | Original post link

To be honest, we can only describe the various issues we are currently encountering. As for how to reproduce them, it’s really hard to say, and we don’t have the conditions to do so. We can’t deliberately create Kafka failures in the production environment just to reproduce a bug. Moreover, the testing environment and the production environment can be vastly different, making it difficult to reproduce the issues. A while ago, we encountered an issue that caused PD to keep restarting, and we posted about it:

| username: Billmay表妹 | Original post link

This has been fixed in the new version, but for the other issues you mentioned, there are no details to work with!

| username: wakaka | Original post link

How about setting up a similar test environment to reproduce it?

| username: Jellybean | Original post link

May I ask what is the QPS of your CDC synchronization to the downstream? If it can’t keep up, there is usually a performance issue that can be optimized accordingly.

Provide more information, and we can help you analyze it.

| username: dba远航 | Original post link

We also encounter various bugs and issues when using domestic databases. You’ll get used to it eventually, as we started late after all.

| username: 像风一样的男子 | Original post link

If you want to use TiDB, it is recommended to go for the enterprise edition. The original manufacturer will back you up with various issues.

| username: porpoiselxj | Original post link

The downstream is directly to Kafka. It’s not a downstream issue; it’s a problem with the CDC itself extracting incremental logs. Suppose I extract 1000 tables; you might find that about 600 tables have normal extraction progress, but then the subsequent tables get stuck for no apparent reason, and the entire CDC’s TSO cannot advance.

| username: Jellybean | Original post link

There is a TiCDC monitoring panel on Grafana. Go there to check some key monitoring metrics and see the internal flow operator’s running status. The panel roughly includes the following:
image

| username: porpoiselxj | Original post link

Upgraded TiDB to version v7.1.3, adjusted the CDC component from the previous mixed deployment with TiDB & PD to an independent deployment, basically okay now.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.