A TiKV Node Fails to Start, Error [FATAL] [lib.rs:491] ["attempt to overwrite compacted entries in

translator_bot · June 22, 2024, 4:34pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: [tikv一个节点无法起来,报错[FATAL] [lib.rs:491] "attempt to overwrite compacted entries in

| username: devopNeverStop

[TiDB Usage Environment] Production Environment / Test / Poc
Production
[TiDB Version]
v6.1.0
[Reproduction Path] What operations were performed that caused the issue

translator_bot · June 22, 2024, 4:34pm

| username: 考试没答案 | Original post link

What operation are you performing? Is it reporting an error, or did it suddenly become like this? Please provide a detailed description of the operation.

translator_bot · June 22, 2024, 4:34pm

| username: devopNeverStop | Original post link

Suddenly it just happened like this.

translator_bot · June 22, 2024, 4:34pm

| username: 考试没答案 | Original post link

Have you tried restarting? What is the current status? Is the service running normally?

translator_bot · June 22, 2024, 4:34pm

| username: devopNeverStop | Original post link

Restarted the service and the server, but it still didn’t come up.

translator_bot · June 22, 2024, 4:34pm

| username: h5n1 | Original post link

It is suspected to be a bug. It probably needs scaling in or out to handle it. Let’s keep the current state and wait for official confirmation.

translator_bot · June 22, 2024, 4:34pm

| username: devopNeverStop | Original post link

Yes, the plan is to add a new node first, complete the balance, and then remove the faulty node.

translator_bot · June 22, 2024, 4:34pm

| username: WalterWj | Original post link

Does this version have a bug with the raft engine? It is recommended to upgrade to the latest version 6.1.
Alternatively, set the raft engine to 1 thread.

translator_bot · June 22, 2024, 4:34pm

| username: devopNeverStop | Original post link

Currently, we don’t dare to upgrade yet; we need to quickly restore the three nodes first.

translator_bot · June 22, 2024, 4:34pm

| username: Minorli-PingCAP | Original post link

Scale in and out first, then recover.

translator_bot · June 22, 2024, 4:34pm

| username: devopNeverStop | Original post link

Added a new node, almost done, the leaders are already in place, but the regions are still a bit short.

translator_bot · June 22, 2024, 4:34pm

| username: devopNeverStop | Original post link

The new node has been added to TiKV, now we need to remove the problematic old TiKV from the cluster and then rejoin it to the cluster.

Scale-in the original TiKV
After execution, it was found that TiKV remained in the pending offline state.
Scale-in --force the original TiKV
The TiKV is no longer visible in TiUP.
Scale-out the original TiKV to the cluster
The log reports an error: a TiKV with the same IP but different ID exists, unable to start the new TiKV.
Using pd-ctl, the original TiKV information can indeed be seen (showing there are still over 300 regions). Using pd-ctl to delete the original TiKV’s ID shows success, but the information still remains.

Currently, the data and deployment directory content of the original TiKV are empty, and it is not possible to use tikv-ctl to clear the region information of the original TiKV displayed in PD.

translator_bot · June 22, 2024, 4:34pm

| username: devopNeverStop | Original post link

Have you considered changing the IP of this server? It should be able to expand into the cluster normally. Are there any other solutions?

translator_bot · June 22, 2024, 4:34pm

| username: h5n1 | Original post link

The steps for scaling down are incorrect, and unsafe recovery needs to be performed.

translator_bot · June 22, 2024, 4:34pm

| username: devopNeverStop | Original post link

Now that the TiKV node is no longer visible in tiup cluster display, can I still use unsafe recovery?

translator_bot · June 22, 2024, 4:34pm

| username: h5n1 | Original post link

TiUP is just a management display, the actual metadata, etc., are in PD. You can use PD-ctl store to check.

translator_bot · June 22, 2024, 4:34pm

| username: devopNeverStop | Original post link

Finally, the original abnormal TiKV was re-added to the cluster through the following steps:

Remove all region information of the original TiKV in PD

for i in $(tiup ctl:v6.1.0 pd -u 192.168.7.188:2379 region store 8 | grep -B 1 start_key | grep id |awk '{print $2}'|sed 's/,//')
do
   tiup ctl:v6.1.0 pd -u 192.168.7.188:2379 operator add remove-peer $i 8
done

Clear all information of the original TiKV in PD

tiup ctl:v6.1.0 pd -u 192.168.7.188:2379 store remove-tombstone

Scale out and re-add the original TiKV node to the cluster

Currently, the new node is in normal balance. Thanks to everyone for their enthusiastic support, especially @h5n1—the first move helped me solve the problem!

translator_bot · June 22, 2024, 4:34pm

| username: mayjiang0203 | Original post link

Known issue, raft engine panic during recovery · Issue #13123 · tikv/tikv · GitHub, has been fixed in version 6.1.1. It is recommended to upgrade to the latest version of the 6.1.x series.

translator_bot · June 22, 2024, 4:34pm

| username: devopNeverStop | Original post link

Thank you.

translator_bot · June 22, 2024, 4:34pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.