TiDB v5.0.5 Version Upgrade Fails After Stopping TiFlash and Performing Rolling Upgrade

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDBv5.0.5 版本升级stop tiflash后滚动升级失败

| username: TiDBer_yyy

[TiDB Usage Environment] Test
[TiDB Version] 5.0.5
[Reference] 使用 TiUP 升级 TiDB | PingCAP 文档中心
[Reproduction Path]

  • tiup and cluster have been upgraded
tiup update --self && tiup update cluster
download https://tiup-mirrors.pingcap.com/tiup-v1.12.2-linux-amd64.tar.gz 7.15 MiB / 7.15 MiB 100.00% 21.92 MiB/s
Updated successfully!
component cluster version v1.12.2 is already installed
Updated successfully!
  • Cluster status
Cluster type:       tidb
Cluster name:       tidb_upgrade_v5
Cluster version:    v5.0.5
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://192.168.22.72:3379/dashboard
ID                   Role     Host           Ports                            OS/Arch       Status   Data Dir                       Deploy Dir
--                   ----     ----           -----                            -------       ------   --------                       ----------
192.168.22.200:8300   cdc      192.168.22.200  8300                             linux/x86_64  Up       /data/tidb-data/cdc-8300       /data/tidb4-deploy/cdc-8300
192.168.22.72:3379    pd       192.168.22.72   3379/3380                        linux/x86_64  Up|L|UI  /data/tidb4-data/pd_3379       /data/tidb4-deploy/pd_3379
192.168.22.161:5000   tidb     192.168.22.161  5000/10180                       linux/x86_64  Up       -                              /data/tidb4-deploy/tidb_5000
192.168.22.200:9000   tiflash  192.168.22.200  9000/8123/3930/20170/20292/8234  linux/x86_64  Up       /data/tidb4-data/tiflash-9000  /data/tidb4-deploy/tiflash-9000
192.168.22.161:20260  tikv     192.168.22.161  20260/20280                      linux/x86_64  Up       /data/tidb4-data/tikv_20260    /data/tidb4-deploy/tikv_20260
192.168.22.200:20260  tikv     192.168.22.200  20260/20280                      linux/x86_64  Up       /data/tidb4-data/tikv_20260    /data/tidb4-deploy/tikv_20260
192.168.22.72:20260   tikv     192.168.22.72   20260/20280                      linux/x86_64  Up       /data/tidb4-data/tikv_20260    /data/tidb4-deploy/tikv_20260
  • stop tiflash
Stopping component tiflash
	Stopping instance 192.168.22.200
	Stop tiflash 192.168.22.200:9000 success
Stopping component node_exporter
Stopping component blackbox_exporter
Stopped cluster `tidb_upgrade_v5` successfully
  • offline upgrade, error: Error: cluster is running and cannot be upgraded offline
tiup cluster upgrade tidb_upgrade_v5 v6.5.2 --offline
tiup is checking updates for component cluster ...
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.12.2/tiup-cluster upgrade tidb_upgrade_v5 v6.5.2 --offline
Before the upgrade, it is recommended to read the upgrade guide at https://docs.pingcap.com/tidb/stable/upgrade-tidb-using-tiup and finish the preparation steps.
This operation will upgrade tidb v5.0.5 cluster tidb_upgrade_v5 to v6.5.2.
Do you want to continue? [y/N]:(default=N) y
Upgrading cluster...

Error: cluster is running and cannot be upgraded offline

Additional Note 1

  • After stopping tiflash, there will be an error TiFlash server timeout
[ 20s ] thds: 1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
FATAL: mysql_drv_query() returned error 9012 (TiFlash server timeout) for query 'SELECT SUM(k) FROM sbtest1 WHERE id BETWEEN 2497 AND 2596'
FATAL: `thread_run' function failed: /usr/share/sysbench/oltp_common.lua:432: SQL error, errno = 9012, state = 'HY000': TiFlash server timeout
Error in my_thread_global_end(): 1 threads didn't exit

Additional Note 2

  • Using --force can ensure the offline execution is completed, but it is unclear what impact it has on TiDB and whether the upgrade is effective
$ tiup cluster upgrade tidb_upgrade_v5 v6.5.2 --offline --force
tiup is checking updates for component cluster ...
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.12.2/tiup-cluster upgrade tidb_upgrade_v5 v6.5.2 --offline --force
...
...
+ [ Serial ] - InitConfig: cluster=tidb_upgrade_v5, user=tidb, host=172.16.22.200, path=/home/tidb/.tiup/storage/cluster/clusters/tidb_upgrade_v5/config-cache/cdc-8300.service, deploy_dir=/data/tidb4-deploy/cdc-8300, data_dir=[/data/tidb-data/cdc-8300], log_dir=/data/tidb4-deploy/cdc-8300/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/tidb_upgrade_v5/config-cache
+ [ Serial ] - UpgradeCluster
Upgraded cluster `tidb_upgrade_v5` successfully
| username: tidb狂热爱好者 | Original post link

First, delete the TiFlash node.

| username: TiDBer_yyy | Original post link

Wouldn’t that mean clearing all the synchronized data in TiFlash, which could potentially disrupt the business?

| username: zhanggame1 | Original post link

The documentation says that it looks like you still need to delete TiFlash.

| username: TiDBer_yyy | Original post link

Directly causing disruption to the business, whether to upgrade needs to be evaluated.
In this way, the pressure on operations, DBA, and business is too great.

Isn’t the official recommended method correct?

| username: 等一分钟 | Original post link

I also have a TiFlash node and encountered errors when upgrading directly, but it seems there are no issues after the upgrade. The node versions are all correct, including TiFlash.

| username: tidb狂热爱好者 | Original post link

You really didn’t read carefully. The person above has already posted the reasons.

| username: TiDBer_yyy | Original post link

Yes.
The official steps mention “TiFlash downtime upgrade.” Does this actually mean deleting TiFlash directly?

If you want to upgrade TiFlash from a version earlier than 5.3 to version 5.3 or later, you must perform a TiFlash downtime upgrade. Refer to the following steps to upgrade TiFlash while ensuring other components run normally:
1. Stop the TiFlash instance: tiup cluster stop <cluster-name> -R tiflash
2. Use the --offline parameter to upgrade the cluster without restarting (only updating files): tiup cluster upgrade <cluster-name> <version> --offline, for example, tiup cluster upgrade <cluster-name> v6.3.0 --offline
3. Reload the entire cluster: tiup cluster reload <cluster-name>. At this point, TiFlash will also start normally without additional operations.
| username: TiDBer_yyy | Original post link

What are your upgrade steps? Did you scale down TiFlash?

| username: 小王同学Plus | Original post link

According to the official website, it is recommended to scale down TiFlash first before performing the upgrade operation.

| username: TiDBer_yyy | Original post link

Thank you for the reply.
After consulting with the official team, the following solutions are available:

If it is really impossible to stop the TiFlash service, you can try to disable the MPP function of TiDB, then upgrade the cluster normally, and then re-enable the MPP function of TiDB:
Step 1: Disable the MPP function of TiDB: Run set @@global.tidb_allow_mpp=0 in TiDB; perform a rolling restart of all TiDB nodes (because the setting of the global variable needs to take effect in a new session, if you do not perform a rolling restart of TiDB, the existing sessions will still generate MPP plans)
Step 2: Upgrade the cluster normally using tiup cluster upgrade
Step 3: Re-enable the MPP function of TiDB: Run set @@global.tidb_allow_mpp=1 in TiDB; and perform a rolling restart of all TiDB nodes

| username: redgame | Original post link

It seems that it needs to be offline according to the prompt.

| username: TiDBer_yyy | Original post link

Can this be understood as causing disruption to production operations?

| username: ljluestc | Original post link

To solve this problem, you need to ensure that the TiDB cluster is stopped before performing the offline upgrade. It seems you tried to stop the TiFlash component using the stop tiflash command, but it resulted in a TiFlash server timeout error.

Here are some suggestions to help resolve the issue and successfully perform the upgrade:

  1. Verify Cluster Status: Run the cluster status command to check the current status of the TiDB cluster. Ensure that all components, including TiFlash, are stopped before attempting the upgrade.

  2. Stop the Cluster: Use the tiup cluster stop tidb_upgrade_v5 command to stop the entire TiDB cluster. This will ensure that all components are properly shut down.

  3. Retry Offline Upgrade: After the cluster is stopped, retry the offline upgrade command: tiup cluster upgrade tidb_upgrade_v5 v6.5.2 --offline. Since the cluster is no longer running, this command should now execute successfully.

If you still encounter issues or need further assistance, please provide more specific details about the problem you are facing, such as any error messages or logs, and we will be happy to assist you further.