Alert Message in TiDB Cluster: Alert Details: Linux server is down or network is unreachable, please address it promptly!

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb-cluster集群中的某个组件出现问题后,告警消息发出的告警为:告警详情:Linux服务器,已经宕机或网络不通,请及时处理!!

| username: vcdog

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.0
[Reproduction Path] When an issue occurs with a component in the tidb-cluster, the alert message is: Alert Details: Linux server has crashed or network is unreachable, please handle it promptly!!

  1. After investigation, it was found that when a tidb-server component fails and the process exits, the node_exporter process also gets automatically shut down. The remote Prometheus cannot detect this node, resulting in an alert for a crash or network unreachable.
  2. In reality, the server’s operating system itself has not crashed; only the tidb-server component has failed.
  3. In this situation, how should we operate to resolve this inaccurate alert issue?

[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration] Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Log/Monitoring]
image

| username: tidb狂热爱好者 | Original post link

This is a malfunction.

| username: 像风一样的男子 | Original post link

I don’t understand. Are you in charge of the server? Is the application failure being blamed on a server failure?

| username: chenhanneu | Original post link

TiDB does not have this alert item by default, right?

| username: vcdog | Original post link

I manage a TiDB cluster, but the alert indicates that the server is down. It’s not about shifting the blame; I mainly want to correct this alert. How can I make the alert function properly so that if the TiDB-server component crashes, it alerts that the component has crashed, rather than indicating that the server is down?

| username: chenhanneu | Original post link

In addition to this alert, are there any alerts for components being down?

| username: tidb狂热爱好者 | Original post link

This is the fault. You need to think about what caused the TiDB component to exit, rather than covering up and fixing this alarm. Although it exited and recovered by itself, it is still a fault. If it accumulates too much, it will crash.
A journey of a thousand miles begins with a single step.
A thousand-mile dam is destroyed by an ant hole.

| username: chnage | Original post link

The node_exporter has exited, and the monitoring thinks it has crashed. Check why the exporter has also crashed, and whether it was killed by OOM (Out of Memory).

| username: vcdog | Original post link

There is no alert indicating a component is down, only an alert for a host being down.

| username: GreenGuan | Original post link

I guess you are directly using TiDB’s monitoring for export. I suggest deploying another one yourself.

| username: vcdog | Original post link

Yes, this situation occurs when the memory of the tidb-server component is exhausted, and an OOM (Out of Memory) event triggers an alert message. Strictly speaking, it can be considered a failure, specifically a failure of the tidb-server. However, the alert message indicating that the server is down is somewhat misleading.

| username: vcdog | Original post link

I was also thinking that it might be possible that when OOM occurs, node_export is killed together. In theory, node_export shouldn’t be killed.

| username: zhanggame1 | Original post link

When a tidb-server component fails and the process exits, the node_exporter process will also be automatically shut down.

These two should not be related, right?

| username: chenhanneu | Original post link

This alert is configured by the system: to detect whether the node-exporter is in a running state.
Guess: The location where the system alert is sent and the location where the TiDB component alert is sent are not the same.
The alertmanager or Prometheus of TiDB might have been OOM killed, but the manually configured detection of node-exporter is a different mechanism and is not affected, so it sends normally.
When the node_exporter is down, TiDB by default sends [EMERGENCY] Node_exporter server is down.

| username: vcdog | Original post link

It is also possible that it is the issue you mentioned. I will ask the operations team to confirm it this afternoon.

| username: 像风一样的男子 | Original post link

Who sent this alert? Why is it showing TiDB_server_is_down on my end?
Node_exporter being down also triggers the Node_exporter_server_is_down alert.

| username: DBAER | Original post link

First, analyze the reason why node_exporter crashed. Later, you can add a daemon process for node_exporter so that it can be restarted automatically if it crashes.

| username: Soysauce520 | Original post link

It depends on the alert rules you have configured. If a component is down, you can change it to a specific component. This is an issue with the description information.

| username: WalterWj | Original post link

This alert content is not the native TiDB alert output. The native one is similar to this teacher’s:

| username: 有猫万事足 | Original post link

Who did this localization?
Can you share it?

All the alert titles on my side are in English.