Can a damaged TiKV replica be manually restored using replicas from other TiKV nodes?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 某个TiKV的副本出现损坏,可以手动通过其他TiKV节点的副本来恢复么?

| username: dba-kit

2024-02-27 11:18:57 (UTC+08:00) TiKV 172.18..:20160 [region_snapshot.rs:234] [“failed to get value of key in cf”] [error=“Engine(Status { code: IoError, sub_code: None, sev: NoError, state: "Corruption: block checksum mismatch: stored = 3987136279, computed = 3440300721, type = 1 in /home/tidb/tidb-data/tikv-20160/db/17957911.sst offset 14845750 size 33058" })”] [cf=default] [region=1294803335] [key=748000000000001BFF0B5F728000000015FF8A4FDA0000000000FAF9EDF818402FFFF1]

Today, a sudden business query error occurred, with the error content as above. Using pd-ctl’s scheduler add evict-leader-scheduler <store-id>, all Leaders on it were evicted, and set global tidb_replica_read = 'leader'; was set to make the business read only from the Leader, temporarily restoring the business.
Currently, a new TiKV node has been added to replace the node with data corruption. However, I would like to know:

  1. Is there any way to recover this corrupted replica?
  2. Besides executing admin check table on all tables, are there any other means to detect corrupted replicas?
| username: dba-kit | Original post link

This method has been tested and seems not very feasible. It appears that admin check table only checks the Leader region. After evicting the Leader on the faulty TiKV, admin check table also succeeded.

| username: xfworld | Original post link

Add a new node and take the old node offline, don’t mess around…

| username: dba-kit | Original post link

Right now, this is how it’s done; we don’t dare to mess around in the production environment :joy:
But I’m wondering, is there any way to detect and fix issues in advance and in real-time?

| username: dba-kit | Original post link

After switching the Leader, manually trigger a region split. It seems like this should fix the issue, as the new two regions should both be generated from the new Leader. However, this has not been verified.

| username: xfworld | Original post link

Multiple replicas and multiple nodes are designed to avoid such problems. It can affect production and serves as an isolation mechanism, which is already quite perfect.

As for early detection? How can you detect hardware failure… It’s a bit difficult… :see_no_evil:

| username: dba-kit | Original post link

Actually, there’s another issue here. When TiDB detects a Region corruption, it doesn’t automatically repair the data from other replicas, nor does it automatically switch the Leader to another replica. It was the business that discovered the problem and reported it to me, and I had to manually evict all Leaders on the affected TiKV to fix it.

| username: xfworld | Original post link

Let’s propose a requirement and see, this is equivalent to having self-healing capabilities…

| username: tidb菜鸟一只 | Original post link

Did you originally enable follower read, and then a TiKV node malfunctioned, affecting the business? Then did you urgently change it to read-only leader?

| username: dba-kit | Original post link

Follower-Read was enabled, but the problematic replica was a Leader, and at that time, admin check table was reporting errors. After evicting the Leader, admin check table no longer reported errors, but the business still reported query errors. Only after remembering that Follower-Read was enabled and disabling it did the business return to normal.

| username: xingzhenxiang | Original post link

This idea is very interesting. Here’s a suggestion: this can also improve TiDB’s maintainability. Keep it up!

| username: TiDBer_jYQINSnf | Original post link

This error is reported by RocksDB. If you must fix it, check which region the key in this SST belongs to, and then migrate this region from this TiKV. It might be recoverable.

# ./tikv-ctl ldb dump --path=../db/17957911.sst --hex
Take the printed key and check it in PD. Just take the first and last two keys to see if they belong to the same region.
#region keys xxxxxx 
If they do not belong to the same region, use region sibling to check neighboring regions until they connect.
Once you find a group of regions, you can try deleting them with remove-peer.

However, it is not recommended to mess around like this. If it’s not a production environment, you can test it yourself for fun. If it is a production environment, it’s more reliable to rebuild. Also, don’t you guys use RAID for your disks? How can files get corrupted so easily?

| username: 像风一样的男子 | Original post link

Back in the day, I also tried to fix SST files like this. Even after the repair, there were still issues, so I ended up formatting the disk and reinstalling everything.

| username: TiDBer_jYQINSnf | Original post link

Your fix is impressive. It’s surprising that TiKV can directly check for corrupted SST files. I wasn’t aware of this before. :+1::+1::+1:

| username: dba-kit | Original post link

It’s a long story :sob:. Alibaba Cloud silently migrated the data on the local SSD disk in the background, and surprisingly, the memory and disk were not migrated together :neutral_face:.

| username: TiDBer_jYQINSnf | Original post link

Alibaba is playing tricks, online migration, awesome!
If something goes wrong, claim compensation :grin:

| username: dba-kit | Original post link

I’ve been working for Alibaba Cloud for the past month. I don’t know what’s going on with Zone D, but every change results in all instances in that zone being pushed maintenance events. There’s no way to cancel it, and we’re forced to upgrade to the so-called “latest architecture.”

| username: TiDBer_jYQINSnf | Original post link

This means your data has been corrupted, which is because you are using TiKV. If it were someone else’s single replica data, it would be truly lost.

| username: TiDBer_5Vo9nD1u | Original post link

Scaling up and down should be fine.

| username: Fly-bird | Original post link

Take this node offline, then deploy a new node, and the data will be synchronized.