What causes reload failure?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: reload失败,是什么原因引起的

| username: 月明星稀

Reload failed, but according to the logs, it should have succeeded in the end. What could be the reason?

| username: Kongdom | Original post link

Display the cluster status?

Are the logs located at /usr/local/server/tikv-2000/log?

| username: 月明星稀 | Original post link

Yes, /usr/local/server/tikv-2000/log/tikv.log, the cluster status is all up.

| username: Jasper | Original post link

  1. Reloading TiKV will wait for the TiKV leader to schedule and restart the TiKV in a loop. The related parameter is transfer-timeout, which defaults to 2 minutes. If the scheduling is not completed within two minutes, the corresponding TiKV will be directly restarted.
  2. From the error you posted, it seems that the corresponding TiKV node did not start when attempting to restart. You can use tiup cluster display to check the status of the corresponding TiKV node.
| username: 月明星稀 | Original post link

All are in up status.

| username: zhanggame1 | Original post link

Redload should have succeeded, it’s just indicating that the eviction of the leader hasn’t been fully completed.

| username: Kongdom | Original post link

Then there’s no problem. I’ve encountered this situation before where the startup is particularly slow and doesn’t succeed within the timeout period. The frontend will report a timeout error, but the backend is still continuing to start. As long as the final cluster status is up, there’s no issue.

| username: 月明星稀 | Original post link

But the other machines haven’t reloaded yet, sigh.

| username: 月明星稀 | Original post link

When encountering this machine, retries are slow and lead to failure. How should I determine the cause of the slowness?

| username: TIDB-Learner | Original post link

You can specify --transfer-timeout to a larger value, or use --force.

| username: 月明星稀 | Original post link

Error: failed to get leader count 1.1.1.40: metric tikv_raftstore_region_count{type=“leader”} not found
There are also such errors reported.

| username: 月明星稀 | Original post link

If a machine in the cluster goes down, is it very likely that the region leader will have issues, and PD won’t start? The logs are as follows:

[2023/12/25 14:47:12.122 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2023/12/25 14:47:46.643 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]

| username: 小龙虾爱大龙虾 | Original post link

Evicting the leader is too slow and times out. Can we add a timeout parameter?

| username: oceanzhang | Original post link

Did it succeed using the --force parameter?

| username: Jellybean | Original post link

Error: failed to restart: 1.1.1.1.58 tikv-2000.service,

This is the cause of the error. Check the log content of the corresponding node to see if there is any abnormal information.

| username: andone | Original post link

Timed out, add a parameter.

| username: tidb菜鸟一只 | Original post link

Look here, tiup cluster reload | PingCAP 文档中心

| username: Kongdom | Original post link

It depends on the number of nodes and replicas. If less than half of the nodes are down, I’ve only encountered a situation where PD couldn’t start once.

| username: dba远航 | Original post link

The RELOAD timed out, but the result was successful. It just indicates that there was a timeout during the RELOAD process, but it ultimately executed successfully.