Restarting a Node in the Cluster - Node Restart Error, Sometimes Fails to Start, Sometimes Succeeds

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 重启集群中的一个节点-重启节点报错,偶尔起不来,偶尔能起来

| username: 艾维iii

[TiDB Usage Environment] Production Environment
[TiDB Version] 7.1.1
[Reproduction Path] Restart any node in the cluster
[Encountered Problem: Problem Phenomenon and Impact]
Error: init config failed: 192.168.4.14:2379: transfer from /root/.tiup/storage/cluster/clusters/tidb-test/config-cache/pd-192.168.4.14-2379.service to /tmp/pd_5c639c96-93a9-4757-89a3-1195781e5ca7.service failed: failed to scp /root/.tiup/storage/cluster/clusters/tidb-test/config-cache/pd-192.168.4.14-2379.service to tidb@192.168.4.14:/tmp/pd_5c639c96-93a9-4757-89a3-1195781e5ca7.service: Process exited with status 1

| username: dba远航 | Original post link

It feels like an issue with SCP instability. Try doing a manual SCP to see if it works.

| username: 艾维iii | Original post link

The image you provided is not accessible. Please provide the text you need translated.

| username: Jasper | Original post link

Does it always report the 192.168.4.14 node? Is there an issue with mutual trust?

| username: 艾维iii | Original post link

I was operating on machine 14, and even SCP had issues on it.

| username: TiDBer_jYQINSnf | Original post link

Can this machine still connect? It looks like there are so many errors.

| username: dba远航 | Original post link

Have you done mutual trust?

| username: 艾维iii | Original post link

Yes, it can be operated on machine 14.

| username: 艾维iii | Original post link

It seems that when deploying the cluster without mutual trust, it is done using a unified username and password.

| username: 艾维iii | Original post link

TiUP is on 14.

| username: Jasper | Original post link

The cluster maintains a set of public keys for mutual trust, located at .tiup/storage/cluster/clusters/<cluster_name>/ssh. However, if there is an issue with mutual trust, it should not start at all. Based on your description, it occasionally starts, so it doesn’t seem like a mutual trust issue.

| username: 艾维iii | Original post link

It shows as failed, but sometimes the node can be brought up, and sometimes it can’t.

| username: 艾维iii | Original post link

I think the main reason is that the number of threads is too high, causing the CPU to be busy with thread switching.

| username: Jasper | Original post link

Check if the public key in id_rsa.pub exists in ~/.ssh/authorized_keys.
Also, what actions did you take that led to the current state?

| username: Jasper | Original post link

You can try reconfiguring mutual trust and see if it works. You can check the following two links:

| username: 艾维iii | Original post link

It should not be.

| username: TiDBer_小阿飞 | Original post link

This should still be caused by the issue of mutual trust, right? Try configuring mutual trust for all machines and then test it again.

| username: 艾维iii | Original post link

I saw that there was a TiDB user before. I don’t know the password. Can I directly change the password?

| username: TiDBer_小阿飞 | Original post link

Both the tidb user and the root user can change their passwords first and then configure mutual trust. It depends on which user you used to set up the cluster environment. Generally, configuring with the root user is sufficient. Given your current situation, it is recommended to establish mutual trust for both the root and tidb users.

| username: 艾维iii | Original post link

After configuring mutual trust, it doesn’t take effect, and SSH still requires a password. It’s frustrating.