Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: TiKV_scheduler_latch_wait_duration_seconds
[TiDB Usage Environment] Production Environment
[TiDB Version]
4013
[Reproduction Path] What operations were performed
No operations were performed, a large number of TiKV_scheduler_command_duration_seconds.1s started appearing a few days ago
[Encountered Problem: Problem Phenomenon and Impact]
Currently, there is no impact on the business
[Resource Configuration]
Corresponding to the picture below
[Attachment: Screenshot/Log/Monitoring]
Then I checked this metric increase and adjusted this parameter. Has anyone adjusted it before, scheduler-concurrency?
Increasing scheduler-concurrency can speed up the execution of scheduling.
The high metric in this monitoring chart indicates that the first phase of the distributed transaction two-phase commit, Prewrite, is relatively slow.
This phase involves two tasks:
- MVCC multi-version check
- Lock conflict detection
Therefore, the troubleshooting approach is roughly as follows:
- First, confirm whether there is a slow write issue in the cluster. This can be analyzed from the overall cluster latency, GRPC latency, and tikv-details monitoring.
- If there is a slow write issue, optimize the slow write situation.
- Check whether there are many lock conflicts in the business SQL. This can be confirmed by viewing the backoff-related monitoring on the TiDB panel.
- If there are, adjust business concurrency or optimize the SQL.
However, we haven’t noticed any particularly slow writes, and there hasn’t been much increase in slow logs. Additionally, the business side hasn’t reported any issues. They usually notify us if there’s even a slight slowdown.
The value of scheduler_latch_wait_duration is generally at the microsecond level, but your cluster screenshot shows it has reached the millisecond level.
Focus on analyzing the TiDB monitoring panel and check the situation in the corresponding KV Errors, especially the monitoring related to KV Backoff.
It seems like there is no change.
The KV Backoff alert has been turned off.
I looked at the dashboard, and recently this SQL was added. Is it related to this “for update”?
This one is also high on the 21st.
The picture clearly shows that there is a high level of lock contention. The business side should be aware of this. Confirm the specific SQL and then optimize the business access method, appropriately reduce concurrency, or optimize the SQL.
“For update, this generally tends to cause lock conflicts, right? Confirm with the developers why they are using this…”
Yes, it was caused by this statement, averaging 2 seconds. After migrating this specific business to MySQL, the alerts disappeared.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.