The above screenshot:
A TiKV contains two RocksDB instances, one for storing Raft logs, located at data/raft. The other is for storing actual data, located at data/db. You can check the specific reasons for stalling by using grep "Stalling" RocksDB in the logs. The RocksDB logs are files starting with LOG, and LOG is the current log.
I understand this sentence. But in the actual TiKV environment, it doesn’t match.
The files are different from what is described in the documentation. How exactly do you check the specific reasons for stalling using grep "Stalling" RocksDB? Which file should be queried?
From the monitoring, it does show “server is busy.” I want to find out exactly where the problem is occurring. In the file: check the specific reason for the stall by using grep "Stalling" RocksDB in the logs. How do I query this?
In addition to the client reporting the server is busy error, it is also necessary to check for any other related anomalies around the time of the issue. The following items, for example, can trigger a server is busy error:
Write stall occurs
Scheduler too busy occurs
Raftstore is busy occurs
Coprocessor backlog occurs
For specific methods of confirmation and solutions, you can refer to the official documentation:
It is likely caused by the high load on the current TiKV server, which may be due to high concurrent access, large data processing, or other system loads. You can check the resource issues (CPU, memory usage, etc.) of the TiKV server during that period, and also look at the monitoring graphs to see if there are any hotspot issues.
In higher versions, the flow control mechanism of RocksDB has been replaced at the scheduler layer. Refer to: TiKV 配置文件描述 | PingCAP 文档中心
At the same time, the TiKV Detail panel has also added a dedicated tab for flow control, where you can view related monitoring.
Check the TiDB logs and search for the keyword serverisbusy to see the reason for the busy status. The root cause is usually high pressure on TiKV or hotspot issues.