When writing data to the server, one node experiences OOM, causing all three large tables to return ERROR 9005 (HY000): Region is unavailable

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 服务器写入数据,其中一个节点oom后,导致其中三张大表全部都 ERROR 9005 (HY000) : Region is unavailable。

| username: Steve阿辉

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.2
[Reproduction Path] Operations that led to the issue: Recently, multiple TiKV nodes have been in the process of going online and offline, and data has not yet been balanced. In this situation, when writing data, one of the TiKV nodes experienced OOM and automatically restarted successfully.
[Encountered Problem: Symptoms and Impact] However, three large tables cannot be opened, and one table can be opened but reports “Region is unavailable.” Other tables in the database can be opened and queried normally.

Is this issue related to file locks or Region corruption? How can I investigate and analyze this? Since I have limited experience, please provide the operation code. Thank you.
[Resource Configuration]

[Attachments: Screenshots/Logs/Monitoring]

| username: 我是咖啡哥 | Original post link

Refer to this for troubleshooting.
Your configuration doesn’t seem like a production setup…
How large is the data volume? How big is the large table you mentioned?

1.1 Client reports Region is Unavailable error

  • 1.1.1 Region is Unavailable is generally due to the Region being unavailable for a period of time (you might encounter TiKV server is busy; or requests sent to TiKV are rejected due to not leader or epoch not match, etc.; or requests to TiKV timeout), TiDB will internally perform a backoff retry. If the backoff time exceeds a certain threshold (default is 20s), an error will be reported to the client. If the backoff is within the threshold, the client will not be aware of the error.
  • 1.1.2 Multiple TiKV instances running out of memory (OOM) simultaneously, causing the Region to have no Leader during the OOM period, see case case-991.
  • 1.1.3 TiKV reports TiKV server is busy error, exceeding the backoff time, refer to 4.3 Client reports server is busy error. TiKV server is busy is part of the internal flow control mechanism and may not be counted in the backoff time in the future.
  • 1.1.4 Multiple TiKV instances fail to start, causing the Region to have no Leader. Deploying multiple TiKV instances on a single physical host, if the physical host fails and due to incorrect label configuration, the Region has no Leader, see case case-228.
  • 1.1.5 Follower apply lags behind, and after becoming the Leader, it rejects received requests with epoch not match, see case case-958 (TiKV needs to optimize this mechanism internally).
| username: Steve阿辉 | Original post link

Could you please help me identify the issue based on my situation, or ask some questions to help me pinpoint the problem? Copying and pasting so much information doesn’t help us locate the issue.

| username: 我是咖啡哥 | Original post link

Based on your description, it is most likely the following situation:

1.1.2 Multiple TiKV instances running out of memory (OOM) simultaneously, causing the Region to have no Leader during the OOM period.

First, check the TiKV logs to see if the TiKV nodes are frequently crashing;
Also, check the slow SQL queries to see if they can be optimized.

| username: Steve阿辉 | Original post link

We resolved the issue last night and demonstrated that the cluster was fixed. The main problem was that TiKV wrote over 300GB of logs due to an error, which filled up the node space and prevented it from starting. Additionally, there was a “Region is unavailable” issue. After clearing some cache information from PD, it miraculously started. This is the current situation.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.