When expanding TiKV with tiup, the old cluster version is retrieved, causing the new node to fail to start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiup扩容tikv时获取到旧的集群版本导致新节点无法启动

| username: Jellybean

[TiDB Usage Environment]
Production Environment

[TiDB Version]
TiDB Cluster Version: Server version: 5.7.25-TiDB-v5.3.1

tiup Version: Local installed version: v1.11.1
1.11.1 tiup
Go Version: go1.19.2
Git Ref: v1.11.1
GitHash: b95172df211e4f9b643590f2dd8436ad60c72b38

[Reproduction Path]
Use tiup to scale out a new tikv instance,
Command executed: tiup cluster scale-out online1 out_tikv_20230206.yaml

[Encountered Problem: Phenomenon and Impact]
Phenomenon: Execution of tiup failed, and the new node reported an error and could not start: "version should compatible with version 5.3.1, got 4.0.6
Explanation: Our cluster was upgraded from 4.0.6 to 5.3.1 six months ago. This time, the new node was scaled out to the old version, causing tikv to fail to start.
Impact: Scale-out failed

[Resource Configuration]
[Attachment: Screenshot/Log/Monitoring]

  1. [2023/02/06 14:53:38.788 +08:00] [ERROR] [util.rs:347] [“request failed”] [err_code=KV-PD-gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some("version should compatible with version 5.3.1,

  2. tiup cluster display online1 shows the cluster version is the old version
    Cluster name: online1
    Cluster version: v4.0.6

The dashboard shows 5.3.1
image

Logging into the machine and checking each instance also shows 5.3.1. Suspect that the information in tiup was not updated correctly during the last upgrade.

| username: tidb菜鸟一只 | Original post link

Are you sure it was successfully upgraded six months ago? Try running tiup cluster display to check.

| username: Jellybean | Original post link

The cluster version displayed by tiup cluster display online1 is an old version.
Cluster name: online1
Cluster version: v4.0.6

The dashboard shows version 5.3.1
image

Logging into the machine and checking each instance also shows version 5.3.1. So, it is suspected that the information in tiup has not been updated.

| username: srstack | Original post link

You can use tiup cluster audit to check the audit log of the previous upgrade. It seems there was an anomaly in the previous upgrade.

| username: Jellybean | Original post link

Do you want to check the log of the last ID?

| username: srstack | Original post link

It looks like you tried three times. Let’s first take a look at the log from the last attempt to see if there are any error messages.

| username: dba-kit | Original post link

You can check the version in .tiup/storage/cluster/clusters/tidbonline/meta.yaml.

| username: Jellybean | Original post link

It’s 4.0.6.

tidb_version: v4.0.6
last_ops_ver: |-
v1.3.2 tiup
Go Version: go1.13
Git Branch: release-1.3
GitHash: 2d88460

| username: dba-kit | Original post link

It is highly likely that the error was caused by an incorrect update of tiup at that time. Tiup reads the version from its own meta.yaml file, while tidb_dashboard should be reading the actual running value. The two are not reading from the same source. If you confirm that the online version is 5.3.1, you can modify tiup’s meta.yaml and redeploy it.

| username: dba-kit | Original post link

Or maybe the tiup directory changed, and after executing the upgrade successfully, the .tiup directory was overwritten by someone else??

| username: Jellybean | Original post link

The online version is confirmed to be normal. Are you referring to manually modifying the file directly in the directory .tiup/storage/cluster/clusters/tidbonline/meta.yaml using vim?

| username: Jellybean | Original post link

Here is the audit log for the final upgrade:
tiup_from_v4.0.6_to_v5.3.1_audit_fTQBhcVB75L_20220607.log (8.6 MB)

| username: dba-kit | Original post link

Yes, but we need to first confirm that the versions of all online components are indeed 5.3.1.

| username: dba-kit | Original post link

Isn’t this interrupted?

| username: dba-kit | Original post link

It seems that first, you should scale in the newly added TiKV nodes, then replay the last upgrade. Theoretically, after a normal upgrade, the version in meta.yaml will automatically change to the latest value.

| username: Jellybean | Original post link

This is a production environment, and we can’t upgrade casually, otherwise it will affect the business.

Moreover, I don’t think this is the reason. The last time, half a year ago, the tiup one-click upgrade was successful in the end.

| username: Jellybean | Original post link

The screenshot you provided shows a startup timeout for node_exporter, which should not have any impact.

The other components of the cluster, tidb, tikv, and tipd, have all been successfully upgraded to version 5.3.1. Confirming the version now also shows it is correct.

| username: dba-kit | Original post link

When replaying, it will proceed from the step where the error occurred, and the steps that have already been successfully executed will not be executed again. Additionally, if you are sure that all key components have been successfully upgraded, you can directly modify the meta.yaml file.
PS: The default timeout for --wait-timeout is 120 seconds, which is indeed a bit short. Upgrading TiKV will definitely fail with this setting, so I always change it to 86400 (24 hours).

| username: dba-kit | Original post link

Although the node_exporter startup timeout does not affect the cluster’s availability, as a step in the tiup cluster upgrade, a failure in execution also means the task fails, and tiup will not change the cluster version.

| username: Jellybean | Original post link

What you said makes sense. The node_exporter failed, and tiup might have modified this version information later. Since tiup did not execute to the end, it did not update.

According to this, I can first scale down the 4.0.6 node, manually modify the version in .tiup/storage/cluster/clusters/tidbonline/meta.yaml, and then scale up with the new 5.3.1 version.