TiFlash fails to start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash无法启动

| username: CHENGX

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v7.1.1
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Attachment: Screenshot/Log/Monitoring]


When adding TiFlash replicas, all TiFlash nodes lost connection. After restarting the TiFlash nodes, the restart reported the above error.

| username: 有猫万事足 | Original post link

The prompt says that the data is already inconsistent.
If there is no issue with TiKV, this should be an inconsistency between TiFlash and TiKV.
It is estimated that all TiFlash replicas need to be removed and re-added.

Were the parameters for adding replicas adjusted to speed up synchronization?

| username: CHENGX | Original post link

I didn’t do much, just executed an alter SQL to add replicas for more than a dozen tables, and then TiFlash crashed and couldn’t restart.

| username: 有猫万事足 | Original post link

It’s a bit strange. Normally, this replica synchronization is very slow and the load is very low, so it shouldn’t cause several machines to go down during synchronization. Usually, we feel that TiFlash has no load during synchronization and we wish we could increase those control parameters several times to speed up the synchronization.

| username: 有猫万事足 | Original post link

select * from INFORMATION_SCHEMA.TIFLASH_REPLICA where progress=1;

Check if some tables have already been synchronized. At least your situation seems like some tables have already been synchronized to TiFlash, and then the query went to TiFlash, causing TiFlash to crash.

| username: CHENGX | Original post link

We have scaled in and out TiFlash.

| username: CHENGX | Original post link

However, after scaling down and up, adding the TiFlash replica again caused it to crash :broken_heart:

| username: 有猫万事足 | Original post link

Take a look at the above SQL. It seems that some tables are used immediately after synchronization. You can also check the slow queries to see if there are any query records related to the tables.

| username: CHENGX | Original post link

There is such a situation, but can’t you query when adding TiFlash? Will this cause TiFlash to fail to restart? :flushed:

| username: 有猫万事足 | Original post link

It’s still necessary to check whether the execution plans of these SQL queries involve TiFlash.
It’s not that queries can’t be executed when TiFlash is added. Rather, some SQL queries might result in very large outputs that consume a lot of resources. Even if you have created all the replicas, these SQL queries will still cause issues when executed on TiFlash.
What I mentioned earlier is a possibility, but it cannot be confirmed as the issue yet.
Ultimately, we need to look at the execution plans of the slow SQL queries. If the “task” column in the execution plan involves TiFlash and the “estRows” column has a large number, then it is likely the issue.

| username: CHENGX | Original post link

In fact, after checking the SQL execution records and execution plan, TiFlash was not used. The data volume is not large, and the confusing part is not that TiFlash crashed, but that TiFlash cannot be restarted.

| username: ShawnYan | Original post link

What does this table look like? Does the data have any special characteristics?

| username: CHENGX | Original post link

These table structures don’t seem to have any particularly special points.

| username: 有猫万事足 | Original post link

That is not the situation I speculated.
The reason TiFlash cannot restart and reports an error is due to data inconsistency. This is likely related to a failure during synchronization.
That’s the relationship.

| username: Kongdom | Original post link

:thinking: Have you upgraded after installing TiFlash?

| username: CHENGX | Original post link

Moreover, after redeploying TiFlash, adding replicas of the same table will reproduce the aforementioned situation.

| username: tidb菜鸟一只 | Original post link

Does adding a replica to a specific table cause TiFlash to crash, or is it that no table can have a replica added? Try creating an empty table and adding a replica to see what happens.

| username: CHENGX | Original post link

The actual operation is like this: upgraded from v6.5.2 to v7.1.1, added a tiflash replica to the table after the upgrade, and then tiflash crashed and could not be restarted.

| username: CHENGX | Original post link

A certain table

| username: Kongdom | Original post link

Online upgrade? But this doesn’t seem to match your version.