Dashboard shows a different number of instances compared to tiup

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Dashboard显示实例数量和tiup不一致

| username: EricSong

[TiDB Usage Environment] Testing
[Reproduction Path] None
[Encountered Problem: Phenomenon and Impact]
The node has multiple roles, with Prometheus and PD both installed on it. Previously, the node crashed due to a large amount of cached data in Prometheus. After restarting and using tiup cluster display tidb-lab, the PD node does not include this node. However, when opening the Dashboard, this node is still listed in the PD node list.
I would like to know which status is accurate and how to synchronize the states between the two sides?
[Attachment: Screenshot/Log/Monitoring]

10.247.168.18:2378   pd            10.247.168.18   2378/2380    linux/x86_64  Up       /tidb-data/pd-2378            /tidb-deploy/pd-2378
10.247.168.77:2378   pd            10.247.168.77   2378/2380    linux/x86_64  Up|L|UI  /tidb-data/pd-2378            /tidb-deploy/pd-2378
10.247.168.75:9090   prometheus    10.247.168.75   9090         linux/x86_64  Up       /tidb-data/prometheus-9090    /tidb-deploy/prometheus-9090

| username: Billmay表妹 | Original post link

Did this problem occur after the upgrade?

| username: Kongdom | Original post link

It should be based on tiup, try restarting the cluster.

| username: EricSong | Original post link

No, the cluster has not been upgraded. It just crashed and restarted after the disk space was full.

| username: EricSong | Original post link

I’ll first look for other solutions because this issue appeared after restarting the node. I’m worried that similar problems might occur online, and we won’t be able to resolve them by restarting the cluster.

| username: tidb菜鸟一只 | Original post link

SELECT * FROM INFORMATION_SCHEMA.CLUSTER_INFO;

| username: srstack | Original post link

Is there a missing PD node in tiup? Have you performed any operations on this PD node using tiup before? The topo information of tiup is stored locally on the tiup machine. If you haven’t used tiup to operate, theoretically, tiup cluster display should not lack information.

| username: EricSong | Original post link

This SQL shows that there are three PD nodes, which is consistent with the Dashboard.

| username: EricSong | Original post link

Yes, the display of tiup is missing a PD node. We performed an expansion operation on this node a few months ago, but there should not have been any similar operations recently.

| username: tidb菜鸟一只 | Original post link

Is the PD process still running on that host? Check the online configuration with tiup cluster edit-config tidb-lab to see if the configuration for this PD is still there.

| username: EricSong | Original post link

edit-config no longer contains

pd_servers:

  • host: 10.247.168.18
    ssh_port: 22
    name: pd-10.247.168.18-2378
    client_port: 2378
    peer_port: 2380
    deploy_dir: /tidb-deploy/pd-2378
    data_dir: /tidb-data/pd-2378
    log_dir: /tidb-deploy/pd-2378/log
    arch: amd64
    os: linux
  • host: 10.247.168.77
    ssh_port: 22
    name: pd-10.247.168.77-2378
    client_port: 2378
    peer_port: 2380
    deploy_dir: /tidb-deploy/pd-2378
    data_dir: /tidb-data/pd-2378
    log_dir: /tidb-deploy/pd-2378/log
    arch: amd64
    os: linux
    cdc_servers:
| username: tidb菜鸟一只 | Original post link

Is the PD process still running on that host? I think you can use tiup to re-specify this PD node and try expanding it.

| username: EricSong | Original post link

The PD process on the machine is still running. I’ll try to re-specify the expansion.
tidb 1430 1.9 2.0 20237948 334792 ? Ssl Jan04 174:45 bin/pd-server --name=pd-10.247.168.75-2378 --client-urls=http://0.0.0.0:2378 --advertise-client-urls=http://10.247.168.75:2378 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://10.247.168.75:2380 --data-dir=/tidb-data/pd-2378 --join=http://10.247.168.18:2378,http://10.247.168.77:2378 --config=conf/pd.toml --log-file=/tidb-deploy/pd-2378/log/pd.log

| username: EricSong | Original post link

An error occurred during scaling, and manually executing systemctl enable node_exporter-9100.service also resulted in the same error.

2023-01-10T07:31:22.234Z ERROR CheckPoint {“host”: “10.247.168.75”, “port”: 22, “user”: “tidb”, “sudo”: true, “cmd”: “systemctl daemon-reload && systemctl enable node_exporter-9100.service”, “stdout”: “”, “stderr”: “Failed to execute operation: No such file or directory\n”, “error”: “executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@10.247.168.75:22’ {ssh_stderr: Failed to execute operation: No such file or directory\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl enable node_exporter-9100.service"}, cause: Process exited with status 1”, “errorVerbose”: “executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@10.247.168.75:22’ {ssh_stderr: Failed to execute operation: No such file or directory\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl enable node_exporter-9100.service"}, cause: Process exited with status 1\n at github.com/pingcap/tiup/pkg/cluster/executor.(*EasySSHExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/ssh.go:174\n at github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/checkpoint.go:85\n at github.com/pingcap/tiup/pkg/cluster/module.(*SystemdModule).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/module/systemd.go:98\n at github.com/pingcap/tiup/pkg/cluster/operation.systemctl()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:376\n at github.com/pingcap/tiup/pkg/cluster/operation.systemctlMonitor.func1()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:286\n at The Go Programming Language\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\n at runtime.goexit()\n\truntime/asm_amd64.s:1581”, “hash”: “ce8eb0a645cc3ead96a44d67b1ecd5034d112cf0”, “func”: “github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute”, “hit”: false}

| username: srstack | Original post link

Manually remove the PD on the abnormal node, then use tiup to scale out again. There might have been an issue during the previous tiup scaling operation. You can upload the audit log of the scaling operation using tiup cluster audit for review.

| username: tidb菜鸟一只 | Original post link

You are having issues starting this PD. Use pdctl to log in and check the health status, then find the corresponding abnormal node ID through member. After deleting the member with ID 1319539429105371180, try to scale out again.

| username: EricSong | Original post link

Previously, I found that the expansion failure was due to the node_exporter.service not being registered on the original machine. After creating the relevant link and registering it, the expansion was successful, and now it shows consistency. I suspect that the previous inconsistency might be related to the node_exporter issue.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.