A TiKV in the cluster suddenly crashed with an "is out of range" error and cannot start

translator_bot · June 21, 2024, 5:07am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群突然挂了一个tikv报错is out of range 启动不了

| username: zhimadi

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Reproduction Path】What operations were performed when the issue occurred
None
【Encountered Issue: Issue Phenomenon and Impact】
During cluster operation, one TiKV suddenly crashed
[2024/01/22 17:54:59.962 +08:00] [FATAL] [lib.rs:465] [“to_commit 3553817 is out of range [last_index 3553815], raft_id: 8209515, region_id: 130773”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:464:18\n
【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

translator_bot · June 21, 2024, 5:07am

| username: 小龙虾爱大龙虾 | Original post link

This is most likely a bug. Check on GitHub to see if it’s a bug. Your cluster version is also quite old.

translator_bot · June 21, 2024, 5:07am

| username: TIDB-Learner | Original post link

I feel a bit confused. I don’t understand the situation, it seems like the code I pulled and the package I deployed are in a bad environment?

translator_bot · June 21, 2024, 5:07am

| username: zhanggame1 | Original post link

It seems like it.

translator_bot · June 21, 2024, 5:07am

| username: Jellybean | Original post link

Did the node suddenly drop while the cluster was running normally, or did the issue occur under other circumstances?

translator_bot · June 21, 2024, 5:07am

| username: ShawnYan | Original post link

The packaging of TiDB is also integrated into Jenkins.

translator_bot · June 21, 2024, 5:07am

| username: tidb菜鸟一只 | Original post link

This is unrelated. This is a bug that hasn’t been fixed for 2 years. The issue the original poster mentioned should be a different bug.

translator_bot · June 21, 2024, 5:07am

| username: zhimadi | Original post link

v5.4.2 feels not old. It’s not good to upgrade in a production environment. I saw other posts with the same issue.

translator_bot · June 21, 2024, 5:07am

| username: zhimadi | Original post link

Official, production environment installed and deployed with tiup

translator_bot · June 21, 2024, 5:07am

| username: zhimadi | Original post link

It’s somewhat similar to this Friday’s Critical Hit: TiKV Node Fails to Start Normally After Crash and this TiKV Node Fails to Start After Crash, Error Message: “[region 32] 33 to_commit 405937 is out of range [last_index 405933]”.

translator_bot · June 21, 2024, 5:07am

| username: zhimadi | Original post link

A TiKV node suddenly went down and couldn’t be restarted while the cluster was running normally. The tikv_stderr.log on the node showed this error.

translator_bot · June 21, 2024, 5:07am

| username: xingzhenxiang | Original post link

Just scale down and then scale up again, no need to worry.

translator_bot · June 21, 2024, 5:07am

| username: TiDBer_jYQINSnf | Original post link

Region 130773 has an issue. Delete this peer and fix it.

pd-ctl>> operator add remove-peer <region_id> <store_id>

Then use tikv-ctl on that TiKV instance to mark the Region’s replica as tombstone to skip the health check during startup:

tikv-ctl --data-dir /path/to/tikv tombstone -p 127.0.0.1:2379 -r <region_id>

This roughly means: the logs of this node have reached the last_index, but to commit, the commit has exceeded the local logs. So it panicked.

translator_bot · June 21, 2024, 5:07am

| username: zhimadi | Original post link

The node has already dropped, can it still execute scaling down? Does it need to be forced?

translator_bot · June 21, 2024, 5:07am

| username: zhimadi | Original post link

I tried deleting, but the more I delete, the more it increases. The deletion speed can’t keep up with the increase. Should I stop the nodes first? How do I stop them?

translator_bot · June 21, 2024, 5:07am

| username: TiDBer_jYQINSnf | Original post link

Are there errors in more than one region? Then just delete this node directly. Physically destroy it. Then execute store delete xxx in PD to delete this store. After this store becomes a tombstone, delete the TiKV directory on it and pull up a new one. The premise is that your cluster has only one faulty TiKV. If there are more than one, you might lose multiple replicas, and the above operations will not be applicable.

translator_bot · June 21, 2024, 5:07am

| username: zhimadi | Original post link

Using this method resolved the issue. Set a Region replica to tombstone state

translator_bot · June 21, 2024, 5:07am

| username: 这里介绍不了我 | Original post link

Learned.

translator_bot · June 21, 2024, 5:07am

| username: xingzhenxiang | Original post link

It is possible to force scale-in.

translator_bot · June 21, 2024, 5:07am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.