The dashboard reports that there are no instances alive on the host, so it is unable to retrieve host information

translator_bot · June 23, 2024, 4:05am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: dashboard报该主机上没有实例存活，因此无法获取主机信息

| username: TiDBer_o0MXVuK4

[TiDB Usage Environment] Testing
[TiDB Version] 6.1.0
[Problem Encountered] After deploying the cluster using tiup, when entering dashboard → cluster information → disk, the disk information of one PD node does not appear, and an error is reported: “Host information cannot be obtained because there are no instances alive on this host.”
[Reproduction Path] No operations were performed, the issue was seen on the dashboard right after deployment.
[Problem Screenshot]

However, everything else is normal, as shown below:

[PD Log File]
pd.log (28.7 KB)
blackbox_exporter.log (764 bytes)
node_exporter.log (11.0 KB)
[Additional Notes]

The cluster was just deployed, no data, no processing done, the machine is clean.
There are three PDs, all highly configured, only this one PD has this issue.
Disk information can be seen in Grafana, as shown below:

51380×702 65.8 KB
The blackbox_exporter and node_exporter of this node are also normal, as shown below:

61267×740 112 KB

translator_bot · June 23, 2024, 4:05am

| username: xfworld | Original post link

Try restarting the problematic PD node… restart it individually…

translator_bot · June 23, 2024, 4:05am

| username: TiDBer_o0MXVuK4 | Original post link

Are you suggesting restarting the problematic PD node? Should I restart the service or the machine?

translator_bot · June 23, 2024, 4:05am

| username: xfworld | Original post link

If it is not the leader node, you can restart the machine first, then restart the service.

tiup cluster restart <cluster-name> --node IP:PORT

translator_bot · June 23, 2024, 4:05am

| username: TiDBer_o0MXVuK4 | Original post link

For now: tiup cluster restart tidb-prod -N 10.0.6.42:2379 doesn’t work, so I’ll try restarting the machine.

translator_bot · June 23, 2024, 4:05am

| username: TiDBer_o0MXVuK4 | Original post link

After restarting the machine and then restarting the service, it still doesn’t work.
I can confirm that all services are running normally.

translator_bot · June 23, 2024, 4:05am

| username: xfworld | Original post link

Which node’s log are you referring to? Is there any abnormal information?

translator_bot · June 23, 2024, 4:05am

| username: xfworld | Original post link

Check the communication between these nodes…

translator_bot · June 23, 2024, 4:05am

| username: TiDBer_o0MXVuK4 | Original post link

It should have nothing to do with this error:

The previous bunch of errors occurred because during the startup phase, this PD was the first to start, and other processes hadn’t started yet. It was just waiting for the next check, and there were no errors afterward.
The error below occurred after restarting when I accessed the UI again without logging in. It automatically redirected to the login page (I just logged out to the login page, then restarted the service, and logged in directly without encountering this error. The IP 185 is my local machine IP).

translator_bot · June 23, 2024, 4:05am

| username: TiDBer_o0MXVuK4 | Original post link

At present, it does not affect usage, it just looks uncomfortable and feels like a bug.

translator_bot · June 23, 2024, 4:05am

| username: qizheng | Original post link

You can check the cluster_hardware view for disk-related information.

translator_bot · June 23, 2024, 4:05am

| username: TiDBer_o0MXVuK4 | Original post link

I checked everything, and the disk information is normal.
I also used the browser’s developer tools to check the returned data, and the disk is also normal.
After a while, another node also had this issue, so I can confirm that the service is functioning normally.

{
    "host": "10.0.6.42",
    "cpu_info": {
        "arch": "amd64",
        "logical_cores": 32,
        "physical_cores": 32
    },
    "cpu_usage": {
        "idle": 0.96,
        "system": 0
    },
    "memory_usage": {
        "used": 30730096640,
        "total": 101191753728
    },
    "partitions": {
        "/boot": {
            "path": "/boot",
            "fstype": "ext4",
            "free": 324261888,
            "total": 499337216
        },
        "/data": {
            "path": "/data",
            "fstype": "ext4",
            "free": 766832631808,
            "total": 951061880832
        },
        "/var/lib/docker/containers": {
            "path": "/var/lib/docker/containers",
            "fstype": "ext4",
            "free": 61470498816,
            "total": 88122449920
        }
    },
    "instances": {
        "10.0.6.42:2379": {
            "type": "pd",
            "partition_path_lower": ""
        }
    }
}

translator_bot · June 23, 2024, 4:05am

| username: Minorli-PingCAP | Original post link

Hello, there are currently no clear logs available for this issue. Based on the description, it seems to be related to a dashboard collection problem. It is recommended to collect the check information of tiup cluster to see if the OS configuration is suboptimal.