What causes reload failure?

translator_bot · June 21, 2024, 8:50am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: reload失败，是什么原因引起的

| username: 月明星稀

Reload failed, but according to the logs, it should have succeeded in the end. What could be the reason?

translator_bot · June 21, 2024, 8:50am

| username: Kongdom | Original post link

Display the cluster status?

Are the logs located at /usr/local/server/tikv-2000/log?

translator_bot · June 21, 2024, 8:50am

| username: 月明星稀 | Original post link

Yes, /usr/local/server/tikv-2000/log/tikv.log, the cluster status is all up.

translator_bot · June 21, 2024, 8:50am

| username: Jasper | Original post link

Reloading TiKV will wait for the TiKV leader to schedule and restart the TiKV in a loop. The related parameter is transfer-timeout, which defaults to 2 minutes. If the scheduling is not completed within two minutes, the corresponding TiKV will be directly restarted.
From the error you posted, it seems that the corresponding TiKV node did not start when attempting to restart. You can use tiup cluster display to check the status of the corresponding TiKV node.

translator_bot · June 21, 2024, 8:50am

| username: 月明星稀 | Original post link

All are in up status.

translator_bot · June 21, 2024, 8:50am

| username: zhanggame1 | Original post link

Redload should have succeeded, it’s just indicating that the eviction of the leader hasn’t been fully completed.

translator_bot · June 21, 2024, 8:50am

| username: Kongdom | Original post link

Then there’s no problem. I’ve encountered this situation before where the startup is particularly slow and doesn’t succeed within the timeout period. The frontend will report a timeout error, but the backend is still continuing to start. As long as the final cluster status is up, there’s no issue.

translator_bot · June 21, 2024, 8:50am

| username: 月明星稀 | Original post link

But the other machines haven’t reloaded yet, sigh.

translator_bot · June 21, 2024, 8:50am

| username: 月明星稀 | Original post link

When encountering this machine, retries are slow and lead to failure. How should I determine the cause of the slowness?

translator_bot · June 21, 2024, 8:50am

| username: TIDB-Learner | Original post link

You can specify --transfer-timeout to a larger value, or use --force.

translator_bot · June 21, 2024, 8:50am

| username: 月明星稀 | Original post link

Error: failed to get leader count 1.1.1.40: metric tikv_raftstore_region_count{type=“leader”} not found
There are also such errors reported.

translator_bot · June 21, 2024, 8:50am

| username: 月明星稀 | Original post link

If a machine in the cluster goes down, is it very likely that the region leader will have issues, and PD won’t start? The logs are as follows:

[2023/12/25 14:47:12.122 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]
[2023/12/25 14:47:46.643 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled”]

translator_bot · June 21, 2024, 8:50am

| username: 小龙虾爱大龙虾 | Original post link

Evicting the leader is too slow and times out. Can we add a timeout parameter?

translator_bot · June 21, 2024, 8:50am

| username: oceanzhang | Original post link

Did it succeed using the --force parameter?

translator_bot · June 21, 2024, 8:50am

| username: Jellybean | Original post link

Error: failed to restart: 1.1.1.1.58 tikv-2000.service,

This is the cause of the error. Check the log content of the corresponding node to see if there is any abnormal information.

translator_bot · June 21, 2024, 8:50am

| username: andone | Original post link

Timed out, add a parameter.

translator_bot · June 21, 2024, 8:50am

| username: tidb菜鸟一只 | Original post link

Look here, tiup cluster reload | PingCAP 文档中心

translator_bot · June 21, 2024, 8:50am

| username: Kongdom | Original post link

It depends on the number of nodes and replicas. If less than half of the nodes are down, I’ve only encountered a situation where PD couldn’t start once.

translator_bot · June 21, 2024, 8:50am

| username: dba远航 | Original post link

The RELOAD timed out, but the result was successful. It just indicates that there was a timeout during the RELOAD process, but it ultimately executed successfully.