How to Analyze TiKV_async_request_write_duration_seconds Alert

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV_async_request_write_duration_seconds告警如何分析

| username: Soysauce520

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5.2
[Encountered Problem: Problem Phenomenon and Impact] When running loaddata at night, there will be a TiKV_async_request_write_duration_seconds alert.
[Resource Configuration] All TiKV nodes use NVMe disks, and no bottleneck is observed in write IO.
Dear experts, how can we avoid such alerts? If we increase the threshold, daytime transactions might be affected. Can we adjust the raftlog parameters to solve this?

| username: buddyyuan | Original post link

Based on experience:
This type of issue is generally a hotspot issue or a large transaction. You can first check the Scheduler worker CPU in the tikv thread panel to see if they are balanced.
The second possibility is disk jitter, for which you need to check the disk IO monitoring.

| username: Soysauce520 | Original post link

In the second case, the disk has been checked and there are no issues.
In the first case, if it is confirmed to be a large transaction, how can it be handled without splitting the transaction? The business is unwilling to split it and cannot be persuaded. Can it be resolved on the database side?

| username: TiDBer_jYQINSnf | Original post link

Check if the load is high on multiple nodes or just a single node. If multiple nodes are high, it means you’ve hit a bottleneck. If it’s just a single node, you might consider adjusting the regions.

Also, does the load require transactions? If transactions are not required, you can modify the batch and commit in smaller, quicker steps.

| username: Soysauce520 | Original post link

I understand that loaddata is possible, but what about insert? There are 10,000 rows of insert values. :disappointed_relieved:

| username: TiDBer_jYQINSnf | Original post link

If this is not changed, it will be difficult to handle. If changed, there is batch

which can also perform non-transactional writes.

| username: TiDBer_aaO4sU46 | Original post link

Try increasing the value of the raft_log_gc_threshold parameter to reduce the frequency of Raft log cleanup.

| username: Kongdom | Original post link

Here is an optimization method for reference:

| username: Soysauce520 | Original post link

The batch scenario is too limited.

| username: Soysauce520 | Original post link

I understand that this is due to slow Raft log synchronization. Wouldn’t reducing it make the synchronization even slower?

| username: Soysauce520 | Original post link

I have referred to it, just wanted to ask if anyone has changed the raftstore.store-pool-size parameter, and how it affects large transactions.

| username: Kongdom | Original post link

I haven’t changed it, but I saw someone in the column who has modified it.

| username: Soysauce520 | Original post link

Currently, I haven’t seen any options to adjust the size of the raft log (similar to the size of Oracle redo). You can only try adjusting the number of CPUs for the store.

| username: redgame | Original post link

The GC trigger threshold for raft logs, can this be adjusted?

| username: Soysauce520 | Original post link

May I ask, what is the specific name of this parameter?

| username: 有猫万事足 | Original post link

I think based on your description, if IO or network is not the bottleneck, then you should indeed try increasing the number of CPUs. At least one of the IO or network should be fully utilized to be considered normal.

| username: Soysauce520 | Original post link

Yes, this alert should be scenario-based. Large-scale data writes at night can be ignored, but during the day, when it’s OLTP, slow writes become an issue. Increasing the CPU is just one countermeasure. For HTAP databases, there should be two sets of alerts.

| username: 有猫万事足 | Original post link

I searched, and indeed, Alertmanager does not have a method for periodic silencing of alerts.
It can only be handled as follows.

Requirement: I need to mute certain alerts during a specific time period every day

For example: There is a remote backup at 4 AM, causing traffic alerts. I want to mute these alerts from 4:00 to 4:30 AM every day.

Approach:

There is no direct step for this, so let’s take an indirect approach;

(1) Scheduled task to create silences at 4 AM every day

(2) Use Alertmanager API in the scheduled task to set up the silences

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.