Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tiup扩容tikv时获取到旧的集群版本导致新节点无法启动
[TiDB Usage Environment]
Production Environment
[TiDB Version]
TiDB Cluster Version: Server version: 5.7.25-TiDB-v5.3.1
tiup Version: Local installed version: v1.11.1
1.11.1 tiup
Go Version: go1.19.2
Git Ref: v1.11.1
GitHash: b95172df211e4f9b643590f2dd8436ad60c72b38
[Reproduction Path]
Use tiup to scale out a new tikv instance,
Command executed: tiup cluster scale-out online1 out_tikv_20230206.yaml
[Encountered Problem: Phenomenon and Impact]
Phenomenon: Execution of tiup failed, and the new node reported an error and could not start: "version should compatible with version 5.3.1, got 4.0.6
Explanation: Our cluster was upgraded from 4.0.6 to 5.3.1 six months ago. This time, the new node was scaled out to the old version, causing tikv to fail to start.
Impact: Scale-out failed
[Resource Configuration]
[Attachment: Screenshot/Log/Monitoring]
-
[2023/02/06 14:53:38.788 +08:00] [ERROR] [util.rs:347] [“request failed”] [err_code=KV-PD-gRPC] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some("version should compatible with version 5.3.1,
-
tiup cluster display online1 shows the cluster version is the old version
Cluster name: online1
Cluster version: v4.0.6
The dashboard shows 5.3.1
Logging into the machine and checking each instance also shows 5.3.1. Suspect that the information in tiup was not updated correctly during the last upgrade.
Are you sure it was successfully upgraded six months ago? Try running tiup cluster display
to check.
The cluster version displayed by tiup cluster display online1
is an old version.
Cluster name: online1
Cluster version: v4.0.6
The dashboard shows version 5.3.1
Logging into the machine and checking each instance also shows version 5.3.1. So, it is suspected that the information in tiup
has not been updated.
You can use tiup cluster audit
to check the audit log of the previous upgrade. It seems there was an anomaly in the previous upgrade.
Do you want to check the log of the last ID?
It looks like you tried three times. Let’s first take a look at the log from the last attempt to see if there are any error messages.
You can check the version in .tiup/storage/cluster/clusters/tidbonline/meta.yaml
.
It’s 4.0.6.
tidb_version: v4.0.6
last_ops_ver: |-
v1.3.2 tiup
Go Version: go1.13
Git Branch: release-1.3
GitHash: 2d88460
It is highly likely that the error was caused by an incorrect update of tiup at that time. Tiup reads the version from its own meta.yaml file, while tidb_dashboard should be reading the actual running value. The two are not reading from the same source. If you confirm that the online version is 5.3.1, you can modify tiup’s meta.yaml and redeploy it.
Or maybe the tiup directory changed, and after executing the upgrade successfully, the .tiup directory was overwritten by someone else??
The online version is confirmed to be normal. Are you referring to manually modifying the file directly in the directory .tiup/storage/cluster/clusters/tidbonline/meta.yaml using vim?
Yes, but we need to first confirm that the versions of all online components are indeed 5.3.1.
It seems that first, you should scale in the newly added TiKV nodes, then replay the last upgrade. Theoretically, after a normal upgrade, the version in meta.yaml will automatically change to the latest value.
This is a production environment, and we can’t upgrade casually, otherwise it will affect the business.
Moreover, I don’t think this is the reason. The last time, half a year ago, the tiup one-click upgrade was successful in the end.
The screenshot you provided shows a startup timeout for node_exporter, which should not have any impact.
The other components of the cluster, tidb, tikv, and tipd, have all been successfully upgraded to version 5.3.1. Confirming the version now also shows it is correct.
When replaying, it will proceed from the step where the error occurred, and the steps that have already been successfully executed will not be executed again. Additionally, if you are sure that all key components have been successfully upgraded, you can directly modify the meta.yaml file.
PS: The default timeout for --wait-timeout
is 120 seconds, which is indeed a bit short. Upgrading TiKV will definitely fail with this setting, so I always change it to 86400 (24 hours).
Although the node_exporter startup timeout does not affect the cluster’s availability, as a step in the tiup cluster upgrade, a failure in execution also means the task fails, and tiup will not change the cluster version.
What you said makes sense. The node_exporter failed, and tiup might have modified this version information later. Since tiup did not execute to the end, it did not update.
According to this, I can first scale down the 4.0.6 node, manually modify the version in .tiup/storage/cluster/clusters/tidbonline/meta.yaml, and then scale up with the new 5.3.1 version.