Is it normal for monitoring anomalies to occur after a TiKV node goes down and its status changes to "down," then recovers to "up" after 1 hour?

According to the image, there are two services that haven’t started and one node is offline.
In fact, all nodes in the cluster are up.

These two are Prometheus monitoring clients. They are not displayed on the display cluster. Go to the problematic node and restart the blackbox_exporter and node_exporter processes.

It can be ignored.

I was thinking it could be ignored, as the TiKV node has already started and is up normally.

Check again. Indeed, these two processes did not start… I didn’t notice. I only thought that the TiKV process needed to be started. -_-||

I looked into the tombstone stores issue and found other articles mentioning that the official documentation states that when a TiKV node crashes and then recovers, the monitoring might not respond. In such cases, you need to delete a certain file in the corresponding directory, but this can be ignored.

pd-ctl -u http://pd_ip:2379 store remove-tombstone
or curl -X DELETE pd-addr:port/pd/api/v1/stores/remove-tombstone

Handle it.

Thank you. I’ve seen this approach, but without a testing environment, I’m hesitant to proceed directly. I’ll just leave it for now.

Take a look at the logs of the two nodes to see what was recorded at that time.

The two corresponding monitoring services did not start automatically after the restart. They were started manually.

