YCSB Benchmark on TiKV Cluster: Disk IO Utilization at 80%, Low CPU and Memory Usage, Write Concurrency Not Improving

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV集群YCSB压测,磁盘IO Util利用率80%,CPU和内存等系统资源利用率很低,写入并发量上不去

| username: Jefrey

[Overview]
Currently, a TiKV cluster is set up using three high-configuration machines in the same data center, each with 64 cores, 128GB RAM, and 2TB NVMe SSDs. The cluster information is as follows:

Each disk has been benchmarked using the fio tool, with 64KB random writes averaging IOPS=21.4k, indicating no performance issues.

[TiKV Version]
v6.5.5

[Issue]

Using three high-configuration machines, each executing the following YCSB benchmark command twice (total benchmark concurrency = 6432=384), regardless of the concurrency level, TiKV write QPS=28.5K ops/s, with single disk IOPS only around 3k, unable to push the disk IOPS higher.

tiup bench ycsb run tikv -p tikv.pd="http://10.104.16.60:2379,http://10.104.16.200:2379,http://10.104.18.60:2379" -p tikv.type="raw" -p recordcount=10000000 -p operationcount=30000000 -p insertproportion=1 --threads 64

At the same time, system CPU, memory, and IO resources are idle, and system resources cannot reach the bottleneck.

I have referred to cluster optimization and tuning, but it has not been effective. TiKV configuration parameters are mostly system defaults, and increasing the thread count previously had no effect.

Thread CPU monitoring data.

Seeking help, what is causing the write concurrency to not increase, and how can it be resolved??

| username: Billmay表妹 | Original post link

Didn’t install TiDB? Only installed raw kv?

| username: Jefrey | Original post link

Yes, we only need to use the TiKV service, combined with JuiceFS to create a unified storage platform.

| username: h5n1 | Original post link

What about the network latency between TiKVs? Check the black exporter.

| username: 像风一样的男子 | Original post link

You can try increasing the threads to 1000.

| username: 啦啦啦啦啦 | Original post link

It seems that a high IO Util for NVMe disks does not necessarily mean that the disk has reached its bottleneck. Try increasing the concurrent write threads to see if it can further improve QPS.

| username: 小龙虾爱大龙虾 | Original post link

Check if the corresponding time meets expectations. If it does, increase concurrency. If not, find more load generators to apply pressure.

| username: pingyu | Original post link

--threads 64 is too small. Try changing it to 600.

| username: zhanggame1 | Original post link

When the concurrency is increased to above 500, a few dozen can’t handle it at all.

| username: Jellybean | Original post link

First, check the performance of the TiKV cluster to see if there are any bottlenecks. Identifying the bottlenecks will make it easier to optimize.

| username: Jefrey | Original post link

After adjusting the pressure test client concurrency and retesting, with thread concurrency set to 1500 (increasing concurrency to 2000 only increases latency without increasing write QPS), the CPU resource usage of the pressure test client machine is below 10%, indicating no performance issues with the pressure test client.

[root@dx-lt-yd-zhejiang-jinhua-5-10-104-4-29 ~]# tiup bench ycsb run tikv -p tikv.pd="http://10.104.16.60:2379,http://10.104.16.200:2379,http://10.104.18.60:2379" -p tikv.type="raw" -p recordcount=10000000 -p operationcount=30000000 -p insertproportion=1 -p readproportion=0 -p updateproportion=0 --threads 500

[root@dx-lt-yd-zhejiang-jinhua-5-10-104-4-30 ~]# tiup bench ycsb run tikv -p tikv.pd="http://10.104.16.60:2379,http://10.104.16.200:2379,http://10.104.18.60:2379" -p tikv.type="raw" -p recordcount=10000000 -p operationcount=30000000 -p insertproportion=1 -p readproportion=0 -p updateproportion=0 --threads 500

[root@dx-lt-yd-zhejiang-jinhua-5-10-104-4-31 ~]# tiup bench ycsb run tikv -p tikv.pd="http://10.104.16.60:2379,http://10.104.16.200:2379,http://10.104.18.60:2379" -p tikv.type="raw" -p recordcount=10000000 -p operationcount=30000000 -p insertproportion=1 -p readproportion=0 -p updateproportion=0 --threads 500

The pressure test data is basically the same as the previous concurrency of 6432=384 QPS, with no increase, indicating that the issue is not due to insufficient concurrency on the pressure test side.

IO Util around 60%

Thread CPU monitoring data.


After increasing the pressure test concurrency to 1500, the average latency increased to 50ms, previously it was within 10ms. Increasing concurrency only increased latency.

| username: Jefrey | Original post link

| username: Jefrey | Original post link

After increasing it by 4 times to 1500, the QPS did not change, and the write latency increased.

| username: Jefrey | Original post link

From the current TiKV-Details Thread CPU monitoring chart, it appears that the usage rate of all internal threads has not reached the set bottleneck.

| username: 像风一样的男子 | Original post link

Could you log in to the dashboard and check if there is a severe hotspot write issue?

| username: h5n1 | Original post link

Take a look at TiDB → query summary → connection idle duration

| username: Jefrey | Original post link

In my environment, there are only TiKV and PD service components.

| username: Jefrey | Original post link

Is it this monitoring graph?

| username: h5n1 | Original post link

Try increasing these two parameters in TiKV:

raftstore.raft-max-inflight-msgs: 1024
raftstore.store-max-batch-size: 2048

I have a feeling that there might be something wrong with your disk. Although SSDs shouldn’t be judged mainly by IO utilization, having 60% utilization at just 1.5k IOPS seems unusual.

| username: Jefrey | Original post link

According to the suggestion, adjust the configuration parameters and restart the TiKV cluster.

The benchmark data did not change significantly.