TiDB Timeout When Starting TiFlash During Restart

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb重启时tifish启动超时

| username: TiDBer_j9d3wEwH

[TiDB Usage Environment] Production Environment
[TiDB Version] v7.1.0
[Reproduction Path] Performed a start-stop operation in the production environment, and during the startup, it got stuck at the tiflash startup page with a timeout error.
[Encountered Problem: Symptoms and Impact]
Checked the logs on the tiflash server
[2024/02/25 13:35:49.867 +08:00] [ERROR] [Server.cpp:844] [“Bootstrap failed because sync schema error: DB::Exception: Wrong column name. Cannot find column 109512.90.1 to drop: table name: t_96616, table id: 96616\nWe will sleep for 3 seconds and try again.”] [thread_id=1]
I checked my production TiDB database and found no table named t_96616. It might have been deleted from the database, leading me to suspect that the issue might be caused by inconsistent or corrupted metadata. I am not sure how to proceed from here. Please help, experts!!! Thank you so much.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: yeminhua | Original post link

Are you restarting the entire cluster? I’ve only encountered this in v6.5.1. Try restarting the TiFlash node separately. Also, the monitoring should be the last step. It seems like you haven’t fully started everything. Log in to Grafana to check the monitoring of several nodes, and start any missing ones separately.

| username: TiDBer_j9d3wEwH | Original post link

Yes, teacher, it completely matches what you said. In the end, it didn’t start completely. Grafana, Alertmanager, CDC, and Prometheus didn’t start completely either. At that time, I used tiup cluster reload tidb-test --role prometheus to start these nodes separately, and they were all successful. However, although the display showed green success, the dashboard page still loaded very slowly. After a few hours, users couldn’t connect to the TiDB database.

| username: wangccsy | Original post link

The cluster failed to start. The schema on the nodes is inconsistent.

| username: TiDBer_j9d3wEwH | Original post link

Yes, yes, how can I solve this?

| username: WalterWj | Original post link

Let’s see if we can get a developer :thinking:. If not, just delete all the tiflash replicas, scale down + scale up tiflash, and add tiflash replicas. Reset it.

| username: TiDBer_j9d3wEwH | Original post link

Thank you for your guidance, teacher. Additionally, will it affect my TiKV data? Because we have tens of terabytes of data in our TiDB database.

| username: WalterWj | Original post link

Normally it won’t, but during the reset process, TiFlash will be unavailable, and adding replicas also consumes certain resources.

| username: 托马斯滑板鞋 | Original post link

Is your TiFlash at the database level or table level? You can try deleting the replica directly and then restarting TiFlash.

| username: TiDBer_j9d3wEwH | Original post link

First of all, thank you for your help, teacher. I only have one TiFlash in my entire TiDB, and I don’t know if it is at the database level or the table level. Additionally, may I ask, teacher, if I delete the TiFlash replica, where on the TiFlash server is the replica address configured?

| username: 托马斯滑板鞋 | Original post link

select * from INFORMATION_SCHEMA.TIFLASH_REPLICA;
Set all inside to 0
Table level:
alter table tabname set TIFLASH REPLICA 0;
Database level:
ALTER DATABASE db_name SET TIFLASH REPLICA 0;

Try restarting TiFlash after clearing

| username: TiDBer_j9d3wEwH | Original post link

I just replied to the first post that the dashboard display page is very slow to enter because two of the TiKV nodes have firewalls enabled, so it has been unable to connect, making it very slow to enter. It has nothing to do with this issue.

| username: TiDBer_j9d3wEwH | Original post link

The issue has been resolved. First, check if there is a record with the query SELECT * FROM INFORMATION_SCHEMA.tables c WHERE c.tidb_table_id='96616';. Then, look at the corresponding table and execute ALTER TABLE ds_qth.gjzf_dw_zz SET TIFLASH REPLICA 0;. Restart TiDB, and the problem should be resolved. Thanks to everyone for their help and suggestions. Much appreciated.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.