Is there a way to downgrade? For example, from 6.5.3 to 6.5.1

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 请问有办法降级嘛?比如从6.5.3 降级到6.5.1

| username: johnwa-CD

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] 6.5.3 Can it be downgraded? Too many issues.
[Reproduction Path] What operations were performed that caused the issue
[Encountered Issues: Issue Symptoms and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]

| username: xfworld | Original post link

Rolling back carries risks; it’s best to back up the data, then create a new cluster and import the data.

If it’s a test environment, you can try rolling back manually, but the success rate won’t be very high…

| username: johnwa-CD | Original post link

In the production environment, it’s a bit of a headache. This 6.5.3 issue is quite serious. It’s really critical now.

| username: zhanggame1 | Original post link

How about upgrading to 7.1 and giving it a try?

| username: 人如其名 | Original post link

What is the specific issue? You can’t just say one sentence. If it’s a performance problem, take the most typical statement and see where it’s slow.

| username: redgame | Original post link

Equivalent to reinstalling the previous version of TiDB and restoring the data.

| username: 像风一样的男子 | Original post link

Are there any issues with version 6.5.3? Could you please elaborate?

| username: xfworld | Original post link

Version 6.5.3 fixed many bugs. What scenarios caused these issues?

| username: johnwa-CD | Original post link

After the cluster upgrade was successful, two nodes kept losing leaders, causing the cluster to remain unstable, and the QPS dropped to less than 1/5 of the original. A large number of execution statements timed out, and it was not an issue with a single statement. Initially, TiFlash reported an error and was unusable. With the support of forum members, the TiFlash node was reinstalled, and normal TiFlash access was restored. Now the cluster is extremely unstable, and any SQL statement may get stuck.

| username: johnwa-CD | Original post link

Production. I thought 6.5.3 was already a fixed version, but after upgrading, the problems were very serious, and people are about to be driven to collapse.

| username: johnwa-CD | Original post link

It is definitely not an issue with a specific SQL query. My cluster originally had a QPS of around 10,000, but now it can’t even handle 2,000.

| username: johnwa-CD | Original post link

You can refer to my TiFlash issue here:
Using TiFlash to execute SQL statements - :ringer_planet: TiDB Technical Issues - TiDB Q&A Community (asktug.com)

And now CDC is crashing in various ways, it’s completely unusable.

| username: xfworld | Original post link

There are two directions:

  1. Find a new environment (using the old cluster version), move the data back, and switch it.
  2. Organize the current issues, focusing on identifying the critical ones, and pass them up so everyone can help you review them.
  • Leader is gone? What does that mean?
  • TiFlash is much more stable in versions after 6.X compared to 5.X.
  • CDC is the same, it has even been restructured… much more stable than before…

If you expect a quicker resolution, you still need to properly organize the background and key issues (even though you are in a hurry, sharpening the axe won’t delay the job of chopping firewood, and the toothpaste squeezing method will undoubtedly increase the time cost).

| username: johnwa-CD | Original post link

  1. My cluster is relatively large and cannot be managed in a short time. It is approximately 20TB in total, with 6 KV nodes and 2 TiFlash nodes.

1). One of the KV nodes’ leader count suddenly drops from 120K to 0, then slowly recovers to balance, but it doesn’t stay stable for more than five minutes before dropping again. This cycle repeats continuously.
2). I upgraded from 6.1.5 to 6.5.3 to resolve some issues in 6.1 (specifically the low performance of CDC, as I saw that 6.5 improved from 4000/s to 35000/s, which convinced me to upgrade).
3). CDC is practically unusable, similar to the issues other friends on the forum have encountered. Please see others’ feedback:
TICDC distributing data to one table per TOPIC often inexplicably freezes, with no new data being written - :sun_with_face: Suggestions / Product Defects - TiDB Q&A Community (asktug.com). No one seems to be able to provide a clear explanation.

| username: johnwa-CD | Original post link

For example, in such a situation, can anyone help me see what’s going on…
I’ve been struggling with this for two or three days, and the entire cluster is on the verge of collapse, basically unusable.

| username: xfworld | Original post link

There is a known issue with CDC where it collects some status and data information from Kafka at startup or periodically for performance and processing statistics of CDC.

However, this issue does not mention when it will be fixed…

There is a workaround, which requires modifying the source code and then packaging a patcher to skip it.

This issue is uncertain, and it might not be the one you encountered.
1687518627202

If the region leader is not balanced enough, you can refer to the troubleshooting guide and check it step by step.
image
Basically, it is a timeout issue, causing the follower to think the original leader is dead, which will initiate an election to select a new leader.

You can refer to the following for specific operations:

| username: xfworld | Original post link

Are there any hot issues? This can also lead to some abnormal situations…

It is recommended to first disable CDC and TiFlash, ensure the stability of TiKV, TiDB, and PD, and then enable them one by one…

| username: johnwa-CD | Original post link

I currently have 6 nodes, four of which are stable, and 2 are experiencing significant drops. I have set them to not enable Leader, so they can barely run. CDC and TiFlash cannot be turned off, especially TiFlash, as business queries use TiFlash to accelerate. If TiFlash is turned off, it will cause a large number of big queries to fall on TiKV, resulting in durations of over 5 minutes. There doesn’t seem to be any obvious hotspot issues. Are there any logs that can be investigated? For example, I see something like this:

2023-06-23 19:04:30 (UTC+08:00)

TiKV 10.10.10.156:20160

[cleanup_sst.rs:119] [“get region failed”] [err=“failed to load full sst meta from disk for [242, 148, 145, 51, 128, 141, 72, 6, 165, 188, 254, 164, 106, 120, 179, 243] and there isn’t extra information provided: Engine Engine(Status { code: IoError, sub_code: None, sev: NoError, state: "Corruption: file is too short (0 bytes) to be an sstable: /data/tidb-data/tikv-20160/import/f2949133-808d-4806-a5bc-fea46a78b3f3_4080615702_42367_145285_write.sst" })”]

| username: johnwa-CD | Original post link

The most likely cause of the KV monitoring panel anomaly is a significant increase in Raft Store CPU usage.

| username: zhanggame1 | Original post link

The log indicates that the file /data/tidb-data/tikv-20160/import/f2949133-808d-4806-a5bc-fea46a78b3f3_4080615702_42367_145285_write.sst has a size of 0. You should check if this is the case, as it seems that some data stored in this TiKV might be lost.