[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.2
[Reproduction Path] Operations that led to the issue: Recently, multiple TiKV nodes have been in the process of going online and offline, and data has not yet been balanced. In this situation, when writing data, one of the TiKV nodes experienced OOM and automatically restarted successfully.
[Encountered Problem: Symptoms and Impact] However, three large tables cannot be opened, and one table can be opened but reports “Region is unavailable.” Other tables in the database can be opened and queried normally.
Is this issue related to file locks or Region corruption? How can I investigate and analyze this? Since I have limited experience, please provide the operation code. Thank you.
[Resource Configuration]
Refer to this for troubleshooting.
Your configuration doesn’t seem like a production setup…
How large is the data volume? How big is the large table you mentioned?
1.1 Client reports Region is Unavailable error
1.1.1 Region is Unavailable is generally due to the Region being unavailable for a period of time (you might encounter TiKV server is busy; or requests sent to TiKV are rejected due to not leader or epoch not match, etc.; or requests to TiKV timeout), TiDB will internally perform a backoff retry. If the backoff time exceeds a certain threshold (default is 20s), an error will be reported to the client. If the backoff is within the threshold, the client will not be aware of the error.
1.1.2 Multiple TiKV instances running out of memory (OOM) simultaneously, causing the Region to have no Leader during the OOM period, see case case-991.
1.1.3 TiKV reports TiKV server is busy error, exceeding the backoff time, refer to 4.3 Client reports server is busy error. TiKV server is busy is part of the internal flow control mechanism and may not be counted in the backoff time in the future.
1.1.4 Multiple TiKV instances fail to start, causing the Region to have no Leader. Deploying multiple TiKV instances on a single physical host, if the physical host fails and due to incorrect label configuration, the Region has no Leader, see case case-228.
1.1.5 Follower apply lags behind, and after becoming the Leader, it rejects received requests with epoch not match, see case case-958 (TiKV needs to optimize this mechanism internally).
Could you please help me identify the issue based on my situation, or ask some questions to help me pinpoint the problem? Copying and pasting so much information doesn’t help us locate the issue.
We resolved the issue last night and demonstrated that the cluster was fixed. The main problem was that TiKV wrote over 300GB of logs due to an error, which filled up the node space and prevented it from starting. Additionally, there was a “Region is unavailable” issue. After clearing some cache information from PD, it miraculously started. This is the current situation.