Database Suddenly Reports Error: Region is Unavailable

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 数据库突然报错Region is unavailable

| username: 像风一样的男子

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.4.3
[Reproduction Path]
During normal business operations, a delete operation suddenly encountered an error: Region is unavailable. The monitoring alarm TiDB_tikvclient_region_err_total was triggered. After a few minutes, it automatically recovered. I checked the regions using pd-ctl and found no issues. What could be the cause of this error?
[Encountered Problem: Phenomenon and Impact]


| username: TiDBer_jYQINSnf | Original post link

Did TiKV crash? More than one crashed?

| username: 像风一样的男子 | Original post link

All clusters are functioning normally, but querying this particular data segment results in an error.

| username: tidb菜鸟一只 | Original post link

This error does not necessarily mean there is an issue with the region. Sometimes it can occur due to TiKV being busy, TiKV OOM, or network issues between TiDB and TiKV. TiDB will internally perform backoff retries. As long as the errors are not too frequent within a certain period, it won’t affect database usage. If an alert is generated, it means that such errors have exceeded the threshold over a period of time. It is recommended to check the following aspects:

Column - Summary of Troubleshooting Region is unavailable | TiDB Community

| username: TiDBer_jYQINSnf | Original post link

Is it an error for just one data query? Or an error over a period of time? If it’s an error for one data query, then you need to check if the replicas are normal. In principle, you don’t have a cluster failure, so you shouldn’t lose replicas. It might be caused by a busy TiKV.

| username: 像风一样的男子 | Original post link

Yes, just that segment of data. Directly deleting it results in an error with over 4000 rows failing to delete. When I run a select *, it takes several minutes without returning any results.

| username: TiDBer_jYQINSnf | Original post link

You can check this link to see if there is an incorrect number of region replicas: PD Control 使用说明 | PingCAP 文档中心.

Check the monitoring to see if there is any particularly busy TiKV.

Look at the link replied by tidb菜鸟一只, it’s just these situations.

| username: 像风一样的男子 | Original post link

I followed the steps in the manuals and columns above and found no issues. It might be due to the large amount of data being deleted.

| username: h5n1 | Original post link

Check if there is region ID information in the TiDB logs, then use the region ID to look for related information in the TiKV logs.

| username: 像风一样的男子 | Original post link

The regionid 2435254 in the logs

| username: h5n1 | Original post link

The TiKV logs for 10.100 seem to indicate that it is caused by region splitting.

| username: 像风一样的男子 | Original post link

From the logs, it appears that the issue is caused by locks and backoff in TiKV. This should be caused by the SQL in the business, right?

| username: h5n1 | Original post link

It seems like the file block is corrupted. Has this error been reported continuously?

| username: 像风一样的男子 | Original post link

After a period of time, it stops reporting.

| username: h5n1 | Original post link

Please share the subsequent logs as well.

| username: 像风一样的男子 | Original post link

tikv.log (8.3 MB)

| username: h5n1 | Original post link

After checking the tikv logs, I also found the “block checksum mismatch” error, and I couldn’t find information about region 2435254. Check if there are “region unavailable” errors in the tidb logs and see if the timestamps match with the tikv logs. Tikv has a command to check for bad SST files, but it requires stopping tikv:
tikv-ctl bad-ssts --db /XXXXX --pd XXXXX:2379

| username: 像风一样的男子 | Original post link

This morning I selected that data and it was reproduced again.


Then after a while, I manually deleted this data and it was not triggered again.

| username: 像风一样的男子 | Original post link

My table has been frequently inserting and deleting data. Is it possible that it is also frequently triggering region splits and merges?

| username: h5n1 | Original post link

When a region reaches a certain size, it will split. I suggest you first check each TiKV for any damaged SST files.