Aliyun Production Database Encounters RocksDB Read Error

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 阿里云生产数据库出现rockdb读取出错

| username: 雨一直下

【TiDB Usage Environment】Production Environment
【TiDB Version】tidb v3.0.8
【Encountered Issue】One of the nodes encountered

[2022/07/14 10:31:13.553 +08:00] [WARN] [endpoint.rs:454] [error-response] [err=“[src/storage/kv/raftkv.rs:378]: RocksDb Corruption: block checksum mismatch: expected 2341394949, got 2266477260 in /data/tidb/deploy/data/db/11559286.sst offset 1232057 size 29030”]

causing a batch of SQL issues. Other nodes are functioning normally, but several SST files on this node have this problem. It is difficult to determine whether it is a disk bad sector or a TiDB database issue, and other files on this node are normal, and the node itself is also normal.

| username: Meditator | Original post link

  1. The underlying cloud storage of cloud hosts is quite complex and is also distributed. Since it is distributed, there are consistency issues.
  2. Check the logs of the corresponding TiKV to see if there are any anomalies or panic occurrences.
  3. Check the OS logs to see if there are any hardware warnings, such as dmesg or kernel logs.
  4. It might be necessary to replicate the data by scaling in and out.
| username: 雨一直下 | Original post link

  1. I have checked the TiKV logs, and there are no related errors.
  2. I have checked dmesg, and there are no issues. The last startup was more than 2 years ago.
  3. Unable to confirm.

Although the node has now recovered without any operations, I am quite worried about encountering this issue again next time.

| username: Meditator | Original post link

That is most likely the case of 1, but you have no evidence.

| username: cs58_dba | Original post link

We are using Alibaba Cloud’s RDS database, which is said to use distributed Ceph storage at the underlying level. It seems that the underlying Ceph storage has reported an error.

| username: Meditator | Original post link

The core technologies of these leading cloud providers are networking and storage, which are the core competitiveness of their cloud products. They rarely use open source solutions and do not disclose their technologies to the public.

| username: cs58_dba | Original post link

I have seen on-site operations personnel use commands like ceph -s, so it should be custom development based on this.

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.