Which LTS version would you recommend upgrading to for an important production cluster, 6.5.3 or 7.1.1?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 重要的生产集群拟升级,你会更建议升到哪个LTS版本,6.5.3 或 7.1.1 ?

| username: Jellybean

[TiDB Usage Environment] Production / Testing

[TiDB Version] The current production environment cluster version is 5.3.1, and we plan to upgrade to a newer stable maintenance version, initially planning for 6.5.3 or 7.1.1.

[Reproduction Path] Before upgrading, we will first verify in the testing environment, and then directly upgrade the production cluster. We look forward to utilizing the feature that improves indexing performance by 10 times. All operations will be based on stable upgrades. Here, I would like to see if any experts have already used these versions in the production environment, if there are any pitfalls, and what should be noted. Let’s gather some ideas first.

[Encountered Issues: Problem Phenomenon and Impact] Due to the large scale of the cluster and the complexity of the business, it is inconvenient to do backups or master-slave clusters. We plan to use the relatively high-risk method of direct upgrade.

[Resource Configuration] This is a large cluster with a total data volume of about 90TB, consisting of 62 TiKV nodes. Each TiKV instance has about 75,000 regions and around 23,000 leaders (the cluster has enabled silent regions).

| username: redgame | Original post link

Our experience is to take small steps, 6.5.3

| username: zhanggame1 | Original post link

Buy the original factory service, the risk is too high.

| username: Kongdom | Original post link

6.5.3, I just upgraded, and adding indexes is indeed much faster. Upgraded from 5.1.0.
Note that you need to shut down for the upgrade, and make sure there are no DDL operations during the upgrade. In short, pay attention to the precautions mentioned in the upgrade documentation.

| username: Jellybean | Original post link

Did you upgrade directly on the original cluster, or did you back up the cluster and then switch the application?

| username: Kongdom | Original post link

First, back up the data, then shut down the original cluster for the upgrade. The backup is just in case.

| username: Jellybean | Original post link

It seems that there is no mandatory requirement for downtime upgrades.

Our cluster has been upgraded several times from 2.0.5 → 3.0.3 → 4.0.6 → 5.3.1, and each time we upgraded directly. Due to the large amount of data, it is difficult to back up the cluster. Did you encounter any issues with downtime upgrades?

| username: Kongdom | Original post link

If you have TiFlash, pay attention to the first point.
This time, I encountered the third issue during the downtime upgrade. I thought there wouldn’t be any DLLs during downtime, so I didn’t pay attention to it.

Of course, online upgrades are possible. In my environment, resources are relatively poor, so downtime upgrades are faster. The main reason is that the business allows for downtime.

| username: Jellybean | Original post link

Yes, if there is a maintenance window for downtime, this is the safest and most reliable option.

Our situation involves a complex HTAP cluster with numerous scenarios. We can tolerate brief fluctuations, but we are not allowed to have downtime for maintenance. It’s somewhat like changing a tire on a highway or repairing a circuit on a high-voltage line.

| username: Kongdom | Original post link

Then let’s focus on the DLL. Actually, the upgrade issue I had the other day wasn’t too serious. It was just that one node had a DLL that caused the TiDB node to fail to start. I resolved it by removing the DLL from another node, and it worked fine.

| username: Jellybean | Original post link

Well, since you upgraded from 5.1.0 to 6.5.3 smoothly, upgrading from 5.3.1 shouldn’t have any major surprises either. Even if there are, you can handle them as they come.

| username: 裤衩儿飞上天 | Original post link

It’s time to consider a disaster recovery cluster. :thinking:

| username: Jellybean | Original post link

A few years ago, a disaster recovery cluster solution was proposed, but it was rejected.

This cluster has forty to fifty servers, which is quite large. Given the current environment, it’s difficult to secure that many resources. It was only possible before because we had ample resources at hand.

| username: ShawnYan | Original post link

Logically, 6.5.3 is more stable, after all, 7.x has introduced a series of new features.

| username: ShawnYan | Original post link

So features like smooth rolling upgrades are very suitable for your extremely large-scale clusters.

| username: Jellybean | Original post link

Yes, as the expert mentioned, smooth upgrades are a crucial factor for us in using TiDB.

| username: 像风一样的男子 | Original post link

Do you upgrade the version by waiting for each KV to gradually evict the leader before restarting and upgrading, or do you use -force to forcibly evict and upgrade?

| username: 大飞哥online | Original post link

Let’s start with v6 first, since it has been stable for a while. We can begin with v7 later.

| username: Jellybean | Original post link

Generally speaking, we usually wait for each KV to gradually evict the leader before restarting and upgrading, so the process of upgrading TiKV can be quite lengthy.

| username: 像风一样的男子 | Original post link

Last year, I tested upgrading from 4.0.9 to 5.4 and found that evicting the leader was very slow, taking several days. Later, in the production environment, I directly used --force to upgrade. Fortunately, there haven’t been any issues so far, but thinking back, the risk was very high.