TiFlash Abnormality

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiFlash异常

| username: FutureDB

【TiDB Usage Environment】Testing
【TiDB Version】V6.5.4
【Reproduction Path】Operations performed that led to the issue
【Encountered Issues: Phenomenon and Impact】
(1) TiFlash shows a restart occurred yesterday in the cluster information on TiDB Dashboard and is currently online, but TiUP query cluster information shows no TiFlash node;
(2) Executing SQL involving TiFlash reports the following error:
SQL Error [1105] [HY000]: [FLASH:Table:SyncError] cannot find schema diff for version: 50088;
(3) Before the restart time shown in the cluster information on TiDB Dashboard, TiFlash was mainly executing operations to create partitions for partitioned tables.

| username: TiDBer_5cwU0ltE | Original post link

How is the resource usage? If resources are insufficient, components are prone to self-restarting.

| username: gary | Original post link

Check the Grafana–system-info monitoring before and after the restart.

| username: changpeng75 | Original post link

It seems that the cluster cannot correctly recognize the TiFlash node. You can try restarting the entire cluster and follow the documentation below to handle it:

If it still doesn’t work, you can only manually decommission the node and redo TiFlash. Refer to the documentation:

| username: DBAER | Original post link

Try restarting directly, the impact should be minimal since it is the logic on the AP side.

| username: ffeenn | Original post link

Check the logs for any error messages. As for not being able to see the components with tiup display, it might be an issue with the tiup metadata. Check the information in ~/.tiup/manifests to see if there’s any problem with tiflash. The quickest way to recover is to follow the suggestion above: take it offline and then bring it back online.

| username: ffeenn | Original post link

The file should be in ~/.tiup/storage/cluster/clusters/{cluster_name}/meta.yaml.

| username: TiDBer_aaO4sU46 | Original post link

I suggest trying to restart the TiFlash node again.

| username: FutureDB | Original post link

There is no significant workload, resources are still sufficient.

| username: FutureDB | Original post link

Rescaled down and then scaled up again.

| username: redgame | Original post link

Try restarting it first, if that doesn’t work, then scale up or down.

| username: FutureDB | Original post link

Finally found the problem. TiFlash was missed during the previous upgrade, so the TiFlash node was still version V6.1.2. Eventually, by scaling down the V6.1.2 TiFlash node and then scaling up the V6.5.4 TiFlash node, the TiFlash node is now functioning normally.

| username: Soysauce520 | Original post link

The upgrade should be automatic. How did you exclude upgrading TiFlash?

| username: FutureDB | Original post link

We are performing an offline upgrade, not an online upgrade.

| username: TiDBer_rvITcue9 | Original post link

Learned.

| username: Soysauce520 | Original post link

From 6.1 to 6.5, the offline upgrade should proceed normally. Did you interrupt the upgrade? Check if the Prometheus and Grafana versions have been upgraded.

| username: JaySon-Huang | Original post link

Finally found the problem. During the previous upgrade, TiFlash was missed, so the TiFlash node was still at version V6.1.2. By scaling down the V6.1.2 version TiFlash node and then scaling up the V6.5.4 version TiFlash node, the TiFlash node is now functioning normally.

It seems that the TiDB in the cluster has been upgraded to 6.5, but TiFlash remained at 6.1. In scenarios where the DDL parallel execution framework introduced since v6.2.0 is encountered, the 6.1 TiFlash cannot handle it, causing errors.

TiDB v6.2.0 introduced a new DDL parallel execution framework, allowing DDL operations on different table objects to be executed concurrently, solving the issue of mutual blocking between DDL operations on different tables. It also supports parallel execution in scenarios such as adding indexes and changing column types on different table objects, significantly improving execution efficiency.