Cluster Write Slow - resolve_lock_lite

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 集群写入慢-resolve_lock_lite

| username: magongyong

【TiDB Usage Environment】Production Environment
【TiDB Version】6.5.5
【Reproduction Path】Operations performed that led to the issue
Two clusters are set as master and slave to each other, using ticdc for synchronization, running normal business operations.

【Encountered Issue: Symptoms and Impact】
The TiKV performance of cluster t1 is good, while the TiKV performance of cluster t2 is only 50% of that of cluster t1.
Cluster t2 has been constantly alerting, and the write performance is very poor. The alerts are shown in the image below:

Upon checking the monitoring, the kv resolve lock reaches over 1 second.


Checking Prometheus, all values greater than 1 are resolve_lock_lite.

Although this cluster has slightly poorer performance, it still consists of 12 TiKV instances deployed on 9 physical machines, all with NVMe disks, yet the write performance is very poor.
How can we solve the issue of slow writes, and how can we reduce resolve_lock_lite, which is a lightweight lock?

【Resource Configuration】Go to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】

| username: h5n1 | Original post link

Check the monitoring of tikv-detail → thread CPU, scheduler-commit → latch wait duration, and disk-performance for this cluster, as well as the value of the scheduler-worker-pool-size parameter.

| username: tidb狂热爱好者 | Original post link

In fact, the poor write performance of TiDB is most likely caused by SQL.

| username: tidb狂热爱好者 | Original post link

Take a look at the top SQL and resolve the slowest one.

| username: 连连看db | Original post link

The default value of tidb_dml_batch_size is 2000.

| username: magongyong | Original post link

The image you provided is not visible. Please provide the text you need translated.

| username: magongyong | Original post link

The image is not visible. Please provide the text you need translated.

| username: magongyong | Original post link

The value of the scheduler-worker-pool-size parameter is 10.

| username: magongyong | Original post link

Sorry, I can’t translate the content from the image. Please provide the text you need translated.

| username: magongyong | Original post link

The image is not visible. Please provide the text you need translated.

| username: magongyong | Original post link

| username: h5n1 | Original post link

The raftstore thread is very busy. You can try adjusting raftstore.store-pool-size to 4 first. Set the configuration in TiKV with raftstore.store-pool-size=4. How high is the overall CPU utilization of TiKV?

| username: magongyong | Original post link

The image is not visible. Please provide the text content for translation.

| username: h5n1 | Original post link

The previous scheduler latch is so different from this one, are they not from the same cluster? You can try increasing scheduler-concurrency to 4096000.

| username: magongyong | Original post link

raftstore.store-pool-size is set to 12.

| username: 江湖故人 | Original post link

Can you check the CPU relationship diagram of TiKV?
TiDB Dashboard Metrics Relationship Diagram | PingCAP Documentation Center

| username: magongyong | Original post link

It is a cluster, and the following is a screenshot from the Scheduler - acquire_pessimistic_lock interface.

| username: magongyong | Original post link

Error occurred

| username: magongyong | Original post link

The overall CPU utilization did not go up, with occasional peaks on the TiFlash server, while TiKV is around 50%.

| username: h5n1 | Original post link

First, try adjusting the Latch parameter.