TiDB Write Performance is Extremely Slow

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB写入超级慢

| username: TiDBer_WSck6fg0

To improve efficiency, please provide the following information. A clear problem description will help solve the issue faster:

【TiDB Usage Environment】
TiDB v5.4.0, TiKV configuration: 20 vCPU, 256GB memory, regular STAT mechanical hard drive

【Overview】 Scenario + Problem Overview
3 TiDB + 3 TiKV cluster + 2 TiFlash, write speed is around 200 records per second, Disk Latency delay is 5~10ms. Another cluster with the same TiDB v5.4.0 and TiKV configuration (20 vCPU, 256GB memory, regular STAT mechanical hard drive) is 4~6 times faster than this one.

【Background】 Actions taken
Newly deployed cluster

【Phenomenon】 Business and database phenomena
Data write to business table is extremely slow, database Disk IO Utilization is 58%~98%, avg 84%

【Problem】 Current issue encountered
How to solve this? No clue where the problem lies. PS: Stopped all read/write operations on the entire cluster, but writing to a table without a primary key is still slow.

【Business Impact】
Write speed is not improving, data delay is too long

【TiDB Version】
TiDB v5.4.0

【Application Software and Version】

【Attachments】 Relevant logs and configuration information

Monitoring (https://metricstool.pingcap.com/)

| username: TiDBer_jYQINSnf | Original post link

Check the RocksDB compaction traffic through Grafana. Is compaction consuming a significant amount of I/O?

| username: Meditator | Original post link

Let’s check the load on machine 222. It seems that the high load on this machine is dragging down the throughput of the entire cluster. The deployment of Prometheus and Grafana on machine 222 is causing an even higher load.

| username: xfworld | Original post link

If you want fast write speed, you should follow the official recommendation and use NVME disks.

Ordinary SATA mechanical hard drives are too slow. If you use such disks, you will have to accept this processing speed…

| username: banana_jian | Original post link

I see that your CDC node’s IO is a bit high. Are you still running CDC synchronization?

| username: TiDBer_WSck6fg0 | Original post link

CDC is currently stopped.

| username: TiDBer_WSck6fg0 | Original post link

Another cluster is also mechanical, but it’s not that slow. Our requirements are not very high, but this one is really super slow.

| username: TiDBer_WSck6fg0 | Original post link

Disk IO Utilization is now 89%. I checked the Disk Latency of other machines, which is 5~10ms. Could this be related? All the machines are like this.

| username: 长安是只喵 | Original post link

The IO is at 100%, it should be that the disk write speed can’t keep up~

| username: xfworld | Original post link

Check the physical condition of the disk, and if possible, try replacing it.

| username: Meditator | Original post link

  1. Firstly, this situation is like the barrel theory, where the high load on one node drags down the entire cluster, affecting the throughput of the whole cluster.
  2. Secondly, since it is an SATA disk with relatively low IOPS, the IO can easily be overwhelmed, making it unsuitable for deploying TiDB.
  3. Lastly, the officially recommended configuration is NVMe SSD.

If you want to prove the first point, you can stop all monitoring processes on 222.

| username: TiDBer_WSck6fg0 | Original post link

| username: TiDBer_WSck6fg0 | Original post link

Okay, I’ll give it a try. Thanks for the explanation.

| username: TiDBer_WSck6fg0 | Original post link

Another situation is that I stopped all read and write operations, meaning there were no read or write operations at all. Then I created a table without a primary key and wrote to it separately, and it was still super slow. How do you explain this?

| username: h5n1 | Original post link

Disk performance is poor, with around 150 Ops and 100% utilization. How many TiKV disks are there? What is the RAID level and RAID cache configuration?

| username: 长安是只喵 | Original post link

Is it possible to enable RAID card cache? I saw similar operations in the community.

| username: TiDBer_jYQINSnf | Original post link

The key point isn’t that this disk is slow, right? Is it that you have two clusters, both with mechanical disks, but one is fast and the other is slow? Is that the case?

| username: TiDBer_WSck6fg0 | Original post link

Yes.

| username: TiDBer_WSck6fg0 | Original post link

The quantity is 6, the level is 5, and I’m not sure how the cache is configured.

| username: TiDBer_jYQINSnf | Original post link

My idea is:
First, check the write traffic of TiKV itself, and then see if the IO of the machines with full disks is all from TiKV writes. If there are other write processes writing a lot, done!

Because from the monitoring, the IO is all full. If there are no other processes writing, and the TiKV write traffic is not much compared to another cluster, then the issue is with the machine disks. Stop the cluster and use sysbench to stress test? See if there are any hardware differences between the two clusters.