Is it normal for monitoring anomalies to occur after a TiKV node goes down and its status changes to "down," then recovers to "up" after 1 hour?

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv其中一个节点宕机之后状态为dwon,1小时之后恢复,状态恢复为up监控异常是否为正常现象

| username: 舞动梦灵

According to the image, there are two services that haven’t started and one node is offline.
In fact, all nodes in the cluster are up.

| username: 像风一样的男子 | Original post link

These two are Prometheus monitoring clients. They are not displayed on the display cluster. Go to the problematic node and restart the blackbox_exporter and node_exporter processes.

| username: 大飞哥online | Original post link

It can be ignored.

| username: 舞动梦灵 | Original post link

I was thinking it could be ignored, as the TiKV node has already started and is up normally.

| username: 舞动梦灵 | Original post link

Check again. Indeed, these two processes did not start… I didn’t notice. I only thought that the TiKV process needed to be started. -_-||

| username: 舞动梦灵 | Original post link

I looked into the tombstone stores issue and found other articles mentioning that the official documentation states that when a TiKV node crashes and then recovers, the monitoring might not respond. In such cases, you need to delete a certain file in the corresponding directory, but this can be ignored.

| username: h5n1 | Original post link

pd-ctl -u http://pd_ip:2379 store remove-tombstone
or curl -X DELETE pd-addr:port/pd/api/v1/stores/remove-tombstone

Handle it.

| username: 舞动梦灵 | Original post link

Thank you. I’ve seen this approach, but without a testing environment, I’m hesitant to proceed directly. I’ll just leave it for now.

| username: zhanggame1 | Original post link

Take a look at the logs of the two nodes to see what was recorded at that time.

| username: 舞动梦灵 | Original post link

The two corresponding monitoring services did not start automatically after the restart. They were started manually.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.