How to Start and Troubleshoot TiFlash When It Is in Down State

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiFlash 处于Down状态,如何启动和排查原因

| username: 郑旭东石家庄

[TiDB Usage Environment]
Production Environment
[TiDB Version]
7.5.1
[Reproduction Path]
No operations performed
[Encountered Issue: Problem Phenomenon and Impact]

Received an alert indicating that TiFlash is offline. Upon inspection, the TiFlash process on the server is running, and telnet operations can access it. The port is available.

Could the experts please advise on how to analyze the error cause and how to get it to start normally? Thank you.

[Resource Configuration]
Two FLASH nodes are deployed. One of them is down.

[Attachments: Screenshots/Logs/Monitoring]




tiflash_error_0519.log (81.4 KB)

| username: 郑旭东石家庄 | Original post link

tiflash_error_0519.log (81.4 KB)

| username: 郑旭东石家庄 | Original post link

The default value of tidb_enable_clustered_index is INT_ONLY, which means that the clustered index is enabled only for tables with an integer primary key. If you want to enable the clustered index for all tables, you can set it to ON.

| username: tidb菜鸟一只 | Original post link

Does the file /flash/tidb-data/tiflash-9000/metadata/db_161619/t_124444.sql not exist? Can you check if this directory and file are in a normal state?

| username: 郑旭东石家庄 | Original post link

This directory exists and contains many SQL files, but the prompted file is not there. This error has also occurred on other dates. It is not an isolated case.

| username: 郑旭东石家庄 | Original post link

System logs found anomalies

| username: DBAER | Original post link

Is it out of memory (OOM)? There are also system logs above.

| username: lemonade010 | Original post link

I still feel it’s related to this log. Check the system log from the 17th and see if there are any errors. It seems like it didn’t start up on the 19th, but it was running and went down on the 17th.

| username: 郑旭东石家庄 | Original post link

System log

| username: 郑旭东石家庄 | Original post link

The most common error log is similar to “/flash/tidb-data/tiflash-9000/metadata/db_161619/t_124444.sql file not found.” The specific log has been provided above.

| username: tidb菜鸟一只 | Original post link

It is strongly recommended not to deploy TiFlash mixed with other nodes. TiFlash has a very high demand for resources, which can easily lead to resource contention with other nodes.

| username: 郑旭东石家庄 | Original post link

This is not possible, resources are limited. We can only deploy in a mixed manner.

| username: 郑旭东石家庄 | Original post link

Moreover, through resource monitoring, apart from memory usage reaching around 80%, CPU and disk resources are quite ample.

| username: 我是人间不清醒 | Original post link

  1. It is better not to deploy TiFlash together with other components if you have high performance requirements.
  2. What is the number of replicas after taking TiFlash offline? It is recommended to cancel acceleration first:
    ALTER TABLE xxx SET TIFLASH REPLICA 0;
    ALTER TABLE xxx SET TIFLASH REPLICA 1;
| username: 郑旭东石家庄 | Original post link

Originally there were 2 replicas, now 1 is down, leaving 1 remaining.

| username: TiDBer_QYr0vohO | Original post link

Your system log shows a SIGSEGV signal, which might be caused by a memory issue leading to a kill.

| username: 郑旭东石家庄 | Original post link

Currently, the available memory is 29G. How can I restart TIFLASH?

| username: 郑旭东石家庄 | Original post link

How to further identify if the issue is caused by memory?

| username: 我是人间不清醒 | Original post link

Run SELECT * FROM information_schema.tiflash_replica to check your current number of replicas.

| username: TiDBer_QYr0vohO | Original post link

Is this a hybrid deployment? So you can’t operate by shutting down and adding memory, right? Then add swap space and see if you can get TiFlash up and running. If the machine can be started, then it’s a memory issue.