Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: reload失败,是什么原因引起的

Reload failed, but according to the logs, it should have succeeded in the end. What could be the reason?
Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: reload失败,是什么原因引起的
Reload failed, but according to the logs, it should have succeeded in the end. What could be the reason?
Display the cluster status?
Are the logs located at /usr/local/server/tikv-2000/log?
Yes, /usr/local/server/tikv-2000/log/tikv.log, the cluster status is all up.
Redload should have succeeded, it’s just indicating that the eviction of the leader hasn’t been fully completed.
Then there’s no problem. I’ve encountered this situation before where the startup is particularly slow and doesn’t succeed within the timeout period. The frontend will report a timeout error, but the backend is still continuing to start. As long as the final cluster status is up, there’s no issue.
When encountering this machine, retries are slow and lead to failure. How should I determine the cause of the slowness?
You can specify --transfer-timeout to a larger value, or use --force.
Error: failed to get leader count 1.1.1.40: metric tikv_raftstore_region_count{type=“leader”} not found
There are also such errors reported.
If a machine in the cluster goes down, is it very likely that the region leader will have issues, and PD won’t start? The logs are as follows:
[2023/12/25 14:47:12.122 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2023/12/25 14:47:46.643 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
Evicting the leader is too slow and times out. Can we add a timeout parameter?
Error: failed to restart: 1.1.1.1.58 tikv-2000.service,
This is the cause of the error. Check the log content of the corresponding node to see if there is any abnormal information.
It depends on the number of nodes and replicas. If less than half of the nodes are down, I’ve only encountered a situation where PD couldn’t start once.
The RELOAD timed out, but the result was successful. It just indicates that there was a timeout during the RELOAD process, but it ultimately executed successfully.