Is there a way to downgrade? For example, from 6.5.3 to 6.5.1

translator_bot · June 22, 2024, 5:29am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 请问有办法降级嘛？比如从6.5.3 降级到6.5.1

| username: johnwa-CD

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] 6.5.3 Can it be downgraded? Too many issues.
[Reproduction Path] What operations were performed that caused the issue
[Encountered Issues: Issue Symptoms and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]

translator_bot · June 22, 2024, 5:29am

| username: xfworld | Original post link

Rolling back carries risks; it’s best to back up the data, then create a new cluster and import the data.

If it’s a test environment, you can try rolling back manually, but the success rate won’t be very high…

translator_bot · June 22, 2024, 5:29am

| username: johnwa-CD | Original post link

In the production environment, it’s a bit of a headache. This 6.5.3 issue is quite serious. It’s really critical now.

translator_bot · June 22, 2024, 5:29am

| username: zhanggame1 | Original post link

How about upgrading to 7.1 and giving it a try?

translator_bot · June 22, 2024, 5:29am

| username: 人如其名 | Original post link

What is the specific issue? You can’t just say one sentence. If it’s a performance problem, take the most typical statement and see where it’s slow.

translator_bot · June 22, 2024, 5:29am

| username: redgame | Original post link

Equivalent to reinstalling the previous version of TiDB and restoring the data.

translator_bot · June 22, 2024, 5:29am

| username: 像风一样的男子 | Original post link

Are there any issues with version 6.5.3? Could you please elaborate?

translator_bot · June 22, 2024, 5:29am

| username: xfworld | Original post link

Version 6.5.3 fixed many bugs. What scenarios caused these issues?

translator_bot · June 22, 2024, 5:29am

| username: johnwa-CD | Original post link

After the cluster upgrade was successful, two nodes kept losing leaders, causing the cluster to remain unstable, and the QPS dropped to less than 1/5 of the original. A large number of execution statements timed out, and it was not an issue with a single statement. Initially, TiFlash reported an error and was unusable. With the support of forum members, the TiFlash node was reinstalled, and normal TiFlash access was restored. Now the cluster is extremely unstable, and any SQL statement may get stuck.

translator_bot · June 22, 2024, 5:29am

| username: johnwa-CD | Original post link

Production. I thought 6.5.3 was already a fixed version, but after upgrading, the problems were very serious, and people are about to be driven to collapse.

translator_bot · June 22, 2024, 5:29am

| username: johnwa-CD | Original post link

It is definitely not an issue with a specific SQL query. My cluster originally had a QPS of around 10,000, but now it can’t even handle 2,000.

translator_bot · June 22, 2024, 5:29am

| username: johnwa-CD | Original post link

You can refer to my TiFlash issue here:
Using TiFlash to execute SQL statements - TiDB Technical Issues - TiDB Q&A Community (asktug.com)

And now CDC is crashing in various ways, it’s completely unusable.

translator_bot · June 22, 2024, 5:29am

| username: xfworld | Original post link

There are two directions:

Find a new environment (using the old cluster version), move the data back, and switch it.
Organize the current issues, focusing on identifying the critical ones, and pass them up so everyone can help you review them.

Leader is gone? What does that mean?
TiFlash is much more stable in versions after 6.X compared to 5.X.
CDC is the same, it has even been restructured… much more stable than before…

If you expect a quicker resolution, you still need to properly organize the background and key issues (even though you are in a hurry, sharpening the axe won’t delay the job of chopping firewood, and the toothpaste squeezing method will undoubtedly increase the time cost).

translator_bot · June 22, 2024, 5:29am

| username: johnwa-CD | Original post link

My cluster is relatively large and cannot be managed in a short time. It is approximately 20TB in total, with 6 KV nodes and 2 TiFlash nodes.

1). One of the KV nodes’ leader count suddenly drops from 120K to 0, then slowly recovers to balance, but it doesn’t stay stable for more than five minutes before dropping again. This cycle repeats continuously.
2). I upgraded from 6.1.5 to 6.5.3 to resolve some issues in 6.1 (specifically the low performance of CDC, as I saw that 6.5 improved from 4000/s to 35000/s, which convinced me to upgrade).
3). CDC is practically unusable, similar to the issues other friends on the forum have encountered. Please see others’ feedback:
TICDC distributing data to one table per TOPIC often inexplicably freezes, with no new data being written - Suggestions / Product Defects - TiDB Q&A Community (asktug.com). No one seems to be able to provide a clear explanation.

translator_bot · June 22, 2024, 5:29am

| username: johnwa-CD | Original post link

For example, in such a situation, can anyone help me see what’s going on…
I’ve been struggling with this for two or three days, and the entire cluster is on the verge of collapse, basically unusable.

translator_bot · June 22, 2024, 5:29am

| username: xfworld | Original post link

There is a known issue with CDC where it collects some status and data information from Kafka at startup or periodically for performance and processing statistics of CDC.

However, this issue does not mention when it will be fixed…

There is a workaround, which requires modifying the source code and then packaging a patcher to skip it.

This issue is uncertain, and it might not be the one you encountered.
1687518627202

github.com/pingcap/tiflow

TiCDC gets the Kafka metadata too frequently and too aggressive

opened 08:45AM - 15 May 23 UTC

closed 05:20AM - 26 Jun 23 UTC

hi-rustin

component/sink type/enhancement area/ticdc kafka affects-6.5 affects-7.1

In TiCDC, we get all topic info with the admin interface: https://github.com/pin…gcap/tiflow/blob/1335f98cdbaf77239bbcbc6b61561e4254449ffe/cdc/sink/dmlsink/mq/manager/kafka_manager.go#L138 This would be a problem when you have a lot of topics and changefeeds. At the same time, the frequency is too high: https://github.com/pingcap/tiflow/blob/1335f98cdbaf77239bbcbc6b61561e4254449ffe/cdc/sink/dmlsink/mq/manager/kafka_manager.go#L87 So we need to solve the problem to get better performance and avoid affecting the downstream Kafka. 1. [ ] Change the frequency to 5-10min 2. [ ] Only get used topic information instead of all topic information See more at https://asktug.com/t/topic/1005884/2

github.com/pingcap/tiflow

sinkv2 metric cause a large increase in latency of KAFKA controller

opened 03:22AM - 15 May 23 UTC

closed 12:32AM - 31 May 23 UTC

zhaoli2333

type/bug severity/major area/ticdc affects-6.1 affects-6.5 affects-7.1

### What did you do? We use TICDC 6.5.2 to sync data into our KAKFA cluster。T…here are approximately 4000 topics in our KAKFA cluster and we created 50+ ticdc jobs。 ### What did you expect to see? All components run normally。 ### What did you see instead? The latency（including produce and consumer latency） of the KAFKA controller increased immediately after we started the jobs. We checked the authorizer log on the KAFKA controller node and found huge numbers of Topic Describe requests. After more experiments, we found that every TICDC job tried to describe all the topics in the KAFKA cluster every 5 seconds which caused the controller overload. After checking the source code, we found that there was an unnecessary operation when the sinkv2 generated metrics by running this: ``` m.updateBrokers() ``` which meant to get broker info but triggered unnecessarily describe requests for all topics. ### Versions of the cluster Upstream TiDB cluster version (execute `SELECT tidb_version();` in a MySQL client): ```console (paste TiDB cluster version here) ``` Upstream TiKV version (execute `tikv-server --version`): ```console (paste TiKV version here) ``` TiCDC version (execute `cdc version`): ```console v6.5.2 ```

If the region leader is not balanced enough, you can refer to the troubleshooting guide and check it step by step.

Basically, it is a timeout issue, causing the follower to think the original leader is dead, which will initiate an election to select a new leader.

You can refer to the following for specific operations:

translator_bot · June 22, 2024, 5:29am

| username: xfworld | Original post link

Are there any hot issues? This can also lead to some abnormal situations…

It is recommended to first disable CDC and TiFlash, ensure the stability of TiKV, TiDB, and PD, and then enable them one by one…

translator_bot · June 22, 2024, 5:29am

| username: johnwa-CD | Original post link

I currently have 6 nodes, four of which are stable, and 2 are experiencing significant drops. I have set them to not enable Leader, so they can barely run. CDC and TiFlash cannot be turned off, especially TiFlash, as business queries use TiFlash to accelerate. If TiFlash is turned off, it will cause a large number of big queries to fall on TiKV, resulting in durations of over 5 minutes. There doesn’t seem to be any obvious hotspot issues. Are there any logs that can be investigated? For example, I see something like this:

2023-06-23 19:04:30 (UTC+08:00)

TiKV 10.10.10.156:20160

[cleanup_sst.rs:119] [“get region failed”] [err=“failed to load full sst meta from disk for [242, 148, 145, 51, 128, 141, 72, 6, 165, 188, 254, 164, 106, 120, 179, 243] and there isn’t extra information provided: Engine Engine(Status { code: IoError, sub_code: None, sev: NoError, state: "Corruption: file is too short (0 bytes) to be an sstable: /data/tidb-data/tikv-20160/import/f2949133-808d-4806-a5bc-fea46a78b3f3_4080615702_42367_145285_write.sst" })”]

translator_bot · June 22, 2024, 5:29am

| username: johnwa-CD | Original post link

The most likely cause of the KV monitoring panel anomaly is a significant increase in Raft Store CPU usage.

translator_bot · June 22, 2024, 5:29am

| username: zhanggame1 | Original post link

The log indicates that the file /data/tidb-data/tikv-20160/import/f2949133-808d-4806-a5bc-fea46a78b3f3_4080615702_42367_145285_write.sst has a size of 0. You should check if this is the case, as it seems that some data stored in this TiKV might be lost.