How to Analyze and Solve the Issue of Slow Data Insertion in a Specific Table After a Period of Time?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 数据写入一段时间了,现在出现某个表写入慢的问题,如何分析解决?

| username: 宸凡_22

【TiDB Usage Environment】Production Environment / Testing / PoC
【TiDB Version】
【Reproduction Path】What operations were performed when the issue occurred
【Encountered Issue: Issue Phenomenon and Impact】
【Resource Configuration】
【Attachments: Screenshots / Logs / Monitoring】

| username: 裤衩儿飞上天 | Original post link

The screenshot you provided only mentions that the write latency of a certain TiKV node is relatively high, and another one mentions that the number of regions on a certain node is too many. Are these two the same node? There are many reasons for slowness, and without enough information, everyone can only guess randomly.
So, can you provide more information? For example, related monitoring, logs, relevant SQL, etc.

| username: 宸凡_22 | Original post link

There are a total of 5 TiKV nodes, all of which are the same.

| username: WalterWj | Original post link

Is there regularity in time?

| username: 裤衩儿飞上天 | Original post link

If all TiKV nodes have these two alerts, what is the disk usage on each node? Are there many empty regions? Are the disks SSDs?

| username: 宸凡_22 | Original post link

The image is not visible. Please provide the text you need translated.

| username: 宸凡_22 | Original post link

The storage is SSD.

| username: 宸凡_22 | Original post link

The image you provided is not visible. Please provide the text you need translated.

| username: 裤衩儿飞上天 | Original post link

Could you send the monitoring graphs for raft store CPU utilization and propose wait duration? Also, please share the TiKV logs.

| username: 裤衩儿飞上天 | Original post link

Additionally, check if the region scheduling is frequent. I see the disk usage is around 60%.

| username: 宸凡_22 | Original post link

The image is not visible. Please provide the text you need translated.

| username: 宸凡_22 | Original post link

The image you provided is not accessible. Please provide the text you need translated.

| username: 裤衩儿飞上天 | Original post link

These monitors:
PD — region - health
PD — Statistics - balance — store leader size / count / score
store region size / count / score
TIKV Details — Thread CPU — raft store CPU

Disk IO monitoring:
node_exporter ---- DISK

| username: 宸凡_22 | Original post link

Sorry, I can’t translate images. Please provide the text you want to be translated.

| username: 裤衩儿飞上天 | Original post link

  1. Check the system’s high-space-ratio and low-space-ratio settings. When disk usage reaches the system’s default of 60%, it will trigger region scheduling. Look at the monitoring data from the past few days to see if there has been more scheduling than before. If so, you can appropriately increase these two parameters. In the long run, expanding nodes is still the solution.
  2. Was there any operation around 10:30? There was a sudden increase in the number of regions, and around 10:40 there was frequent leader scheduling. The write latency of TiKV was very high at that time.
  3. Please provide the TiKV logs.
  4. By the way, what type of SSD is your disk? It seems a bit slow.
| username: 宸凡_22 | Original post link

Add TiKV?

| username: xfworld | Original post link

You can first consider optimizing the empty regions to reduce losses.

Then check the information on the DM2 disk. You can compare multiple TiKV instances to see if they are all roughly within the same range…

| username: 宸凡_22 | Original post link

How to optimize empty regions and reduce losses. Please provide some guidance.

| username: 宸凡_22 | Original post link

TiKV logs
Link: 百度网盘-链接不存在
Extraction code: 3phz

| username: 宸凡_22 | Original post link

Sorry, I cannot translate images. Please provide the text you need translated.