TiFlash is Down, Part 2!

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiFlash 处于Down状态, 第二弹 !!!

| username: 郑旭东石家庄

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path]
[Encountered Problem: Problem Phenomenon and Impact]
On the afternoon of June 1st, we suddenly found that TiFlash crashed again. This is the second time. The first time was in the following post:
TiFlash 处于Down状态,如何启动和排查原因 - TiDB 的问答社区.

The same problem occurred again.
Most of the replies last time suggested it was a resource issue or a mixed deployment issue. Learning from the last experience, we took the problematic node offline and no longer mixed TiFlash with TiKV. However, due to limited resources, we still placed TiFlash and TiDB server on the same server. The memory usage was less than 60%, and CPU usage was less than 10%.


Checking the operating system logs, we found the following information:

Here are the error logs around 17:00:
error.log (262.1 KB)

Observing the error logs, we found many warning messages before the failure:

Could it be that after importing files into TiFlash, we directly deleted the TiKV table, causing the crash?

This time matches the time the system logs show the process was killed.

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]

| username: 郑旭东石家庄 | Original post link

Discovered a large number of core files. After opening and tracing with gdb, found the following:

| username: xfworld | Original post link

Could you review the entire operation process? It looks very messy…

Then, it would be best to describe in detail the issues before and after these operations…

| username: 郑旭东石家庄 | Original post link

This is the same cluster.


In May, the TiFlash on 192.168.55.24 crashed. At that time, I posted a thread, and many people said it was due to resource competition because TiKV and TiFlash were deployed on the same server, causing the issue. When I checked, the memory usage peaked at 87%. Following everyone’s advice, I took the TiFlash on that server offline and scaled down. Then, on June 1st, the only remaining TiFlash deployed on 192.168.55.25 also crashed. The developers reported that they couldn’t connect to the database. Upon investigation, I found the issue. There was no manual operation involved. The most frequent operations were CRUD and truncate, which were the same as before.
I then checked the resource usage and found that both CPU and memory usage were low.

I then checked the system logs and found a kill message at 17:16.


I then checked the TiFlash logs, which are in the error file attached above, and found a message indicating that a certain SQL file could not be found. Then I found a core file record under TiFlash, and upon checking, I found it was a memory error.


This is the entire process of the incident.

| username: tidb菜鸟一只 | Original post link

Your TiFlash is no longer competing for resources with TiKV, but has moved to the TiDB server. Isn’t it still competing for resources there?
If you have multiple TiDB instances, try scaling down the TiDB server on this machine first and let TiFlash exclusively use one machine to see what happens.
Additionally, it might not be the issue you mentioned:
Could it be that after importing the file into TiFlash, you directly deleted the table from TiKV, causing the crash?
You are currently facing a situation where TiFlash cannot find its corresponding table structure file. The table still exists on TiKV, but it is missing on TiFlash.
This is different from the situation where the table is deleted from TiKV but still exists on TiFlash.

| username: h5n1 | Original post link

Is this cluster upgraded to 7.5.1 or is it a fresh installation?

| username: 郑旭东石家庄 | Original post link

Freshly installed

| username: 郑旭东石家庄 | Original post link

There are two issues now:
The first is how to determine what operation caused this problem and how to avoid it in the future.
The second is if it’s a resource contention issue, the resource usage should be very high for contention to occur. However, the CPU and memory usage are not high, so there shouldn’t be a contention problem.

| username: tidb狂热爱好者 | Original post link

TiDB is also a memory hog, and TiFlash should be deployed separately.

| username: 郑旭东石家庄 | Original post link

At the time of the issue, the memory usage was less than 60% :sweat_smile:
How do I explain this to the boss? If it had reached 80%, it would be easier to explain. Now with such low memory usage, it’s hard to explain.

| username: 托马斯滑板鞋 | Original post link

Check the system logs and look at the core dump files. It is highly likely that the core dump was killed by the system due to insufficient memory.

| username: 郑旭东石家庄 | Original post link

I checked the system inspection platform, and the memory usage is less than 60%.

| username: h5n1 | Original post link

select TABLE_SCHEMA, TABLE_NAME, TIDB_TABLE_ID from information_schema.tables where table_id=146376;  Check what operations you have performed on this table

pd-ctl region 7596696  Check the status of this region
| username: 托马斯滑板鞋 | Original post link

Is there a log at the corresponding time point in /var/log/messages?

| username: 郑旭东石家庄 | Original post link

This is the system log with the corresponding time.

| username: tidb狂热爱好者 | Original post link

Unofficial deployment is not necessarily unreliable. The Alibaba Cloud instance you purchased is also a virtual machine. If the TiFlash table is corrupted, take the TiFlash node offline and then bring it back online.

| username: 郑旭东石家庄 | Original post link

Local physical machine

| username: 郑旭东石家庄 | Original post link

The image you provided is not visible. Please provide the text you need translated.

| username: 郑旭东石家庄 | Original post link

According to the developers’ feedback, it’s just normal CRUD operations, with at most a few truncate operations.

| username: 托马斯滑板鞋 | Original post link

What version of TiDB is this?