Abnormal Duration of Various Locks in TiKV, TiDB GC Unable to Execute Properly

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV各种锁持续时间异常,TIDB GC无法正常执行

| username: TiDBer_27OdodiJ

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.0
[Reproduction Path] None
[Encountered Issues: Problem Phenomenon and Impact]
Issues:

  1. TiKV continuously receives related alerts, unable to eliminate [TiKV scheduler latch wait duration seconds more than 1s], [TiKV scheduler context total]
  2. TIDB GC process cannot proceed normally
  3. Client logs report Tikv server is busy
    =========
    Personal Investigation Results:
    The current cluster had Juicefs metadata cluster added yesterday, with only Juicefs client. These issues appeared after some time of use.
    I looked for official related alert handling but couldn’t find anything matching the issues I encountered.

[Resource Configuration]
3 physical machines, each with 2 NVME disks, each machine deployed with 2 TiKV instances
[Attachments: Screenshots/Logs/Monitoring]
===============================Related Monitoring Screenshots====================

  1. Machine Performance Monitoring:
    Overall machine resource usage is not high

  2. gRPC Related Monitoring:
    Various locks last for a very long time

  3. CPU Monitoring of Each Component:
    One of the TiKV scheduler CPUs is consistently higher than the other 5 TiKVs

  4. GC Related Panel:
    GC savepoint stays at a very old point

==============================Related Log Screenshots=================
TiDB Logs:
Server is busy logs are all for the same region id

GC Related Logs:

TiKV Scheduler CPU Abnormal Logs:
A large number of the following warning logs

Region Information Involved in Logs:

| username: JonnieLee | Original post link

Did it get better after a restart or did it resolve itself after a while? Did you perform any large data processing during this period?

| username: xfworld | Original post link

Check the status of the previous cluster first?

| username: TiDBer_27OdodiJ | Original post link

Only restarted the TiDB service, but it had no effect.

| username: TiDBer_27OdodiJ | Original post link

The cluster service status is all normal, right?

| username: JonnieLee | Original post link

Are there a lot of reads and writes or rollbacks? Is this a production database or a test database? You might need to restart TiKV.

| username: xfworld | Original post link

Try checking the hotspot traffic through the dashboard.

A high number of data locks indicates conflicts at the business level.

For region [15025515], I suggest you check which table it is associated with and why it is causing such severe conflicts.


Does this version meet the business expectations?

| username: TiDBer_27OdodiJ | Original post link

Production database

| username: TiDBer_27OdodiJ | Original post link

This cluster does not have upper-level tables, the Juicefs client directly reads and writes to TiKV.
Traffic hotspot map: