TiKV_scheduler_latch_wait_duration_seconds

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV_scheduler_latch_wait_duration_seconds

| username: 路在何chu

[TiDB Usage Environment] Production Environment

[TiDB Version]
4013
[Reproduction Path] What operations were performed
No operations were performed, a large number of TiKV_scheduler_command_duration_seconds.1s started appearing a few days ago
[Encountered Problem: Problem Phenomenon and Impact]
Currently, there is no impact on the business
[Resource Configuration]
Corresponding to the picture below
[Attachment: Screenshot/Log/Monitoring]

| username: 路在何chu | Original post link

Then I checked this metric increase and adjusted this parameter. Has anyone adjusted it before, scheduler-concurrency?

| username: dba远航 | Original post link

Increasing scheduler-concurrency can speed up the execution of scheduling.

| username: Jellybean | Original post link

The high metric in this monitoring chart indicates that the first phase of the distributed transaction two-phase commit, Prewrite, is relatively slow.

This phase involves two tasks:

  • MVCC multi-version check
  • Lock conflict detection

Therefore, the troubleshooting approach is roughly as follows:

  1. First, confirm whether there is a slow write issue in the cluster. This can be analyzed from the overall cluster latency, GRPC latency, and tikv-details monitoring.
    • If there is a slow write issue, optimize the slow write situation.
  2. Check whether there are many lock conflicts in the business SQL. This can be confirmed by viewing the backoff-related monitoring on the TiDB panel.
    • If there are, adjust business concurrency or optimize the SQL.
| username: 路在何chu | Original post link

However, we haven’t noticed any particularly slow writes, and there hasn’t been much increase in slow logs. Additionally, the business side hasn’t reported any issues. They usually notify us if there’s even a slight slowdown.

| username: Jellybean | Original post link

The value of scheduler_latch_wait_duration is generally at the microsecond level, but your cluster screenshot shows it has reached the millisecond level.

Focus on analyzing the TiDB monitoring panel and check the situation in the corresponding KV Errors, especially the monitoring related to KV Backoff.

| username: 路在何chu | Original post link

It seems like there is no change.

| username: 路在何chu | Original post link

The KV Backoff alert has been turned off.

| username: 路在何chu | Original post link

I looked at the dashboard, and recently this SQL was added. Is it related to this “for update”?

| username: 路在何chu | Original post link

This one is also high on the 21st.

| username: Jellybean | Original post link

The picture clearly shows that there is a high level of lock contention. The business side should be aware of this. Confirm the specific SQL and then optimize the business access method, appropriately reduce concurrency, or optimize the SQL.

| username: tidb菜鸟一只 | Original post link

“For update, this generally tends to cause lock conflicts, right? Confirm with the developers why they are using this…”

| username: 路在何chu | Original post link

Yes, it was caused by this statement, averaging 2 seconds. After migrating this specific business to MySQL, the alerts disappeared.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.