Performance Testing of a 96-Core 3-Node Bare TiKV Cluster with go-ycsb: TiKV-server CPU Not Fully Utilized

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 96核3台裸TIKV集群 go-ycsb性能压测,TIKV-server CPU拉不满

| username: TiDBer_AGzauEA1

[Test Environment] TiDB
[TiDB Version] 6.5.0
[Reproduction Path] ./go-ycsb run tikv -P workloads/workloadc -p tikv.pd=“51.20.128.101:2379” -p tikv.type=“raw” -p recordcount=1000000 -p operationcount=10000000 -p threadcount=200
[Issue: Symptoms and Impact]


image

[Resource Configuration] Bare metal 1PD3TIKV cluster
image

[Attachments: Screenshots/Logs/Monitoring]
Network 25GE, NVMe disk, it’s almost impossible to reach the bottleneck.
How should I adjust to increase CPU usage?

The current phenomenon is that setting readpool.unified.max-thread-count to 10 can improve performance, but CPU usage is still low.

| username: dba远航 | Original post link

Low CPU usage is a good sign, why force it to increase? Increasing concurrency can raise it.

| username: 像风一样的男子 | Original post link

Stress testing is used to test the limits and weaknesses of a cluster. CPU, memory, network, and disk read/write can all potentially cause performance bottlenecks.

| username: 小于同学 | Original post link

Stress testing?

| username: changpeng75 | Original post link

How many NVME drives does each node have?

| username: 小龙虾爱大龙虾 | Original post link

How many CPUs do you have? Check the CPU thread pool monitoring on the TiKV panel to see if it’s full.

| username: TiDBer_AGzauEA1 | Original post link

Currently, increasing the concurrency (trying to increase the number of threads) still doesn’t improve the performance. Now, I want to maximize the load to see the performance metrics.

| username: TiDBer_AGzauEA1 | Original post link

Each node uses a single NVMe drive as the data disk for TiKV.

| username: TiDBer_AGzauEA1 | Original post link

With two CPUs totaling 96 cores, this CPU usage is far from enough for stress testing.

| username: 小龙虾爱大龙虾 | Original post link

Let’s take a look at the TiKV monitoring, focusing mainly on the thread CPU section.

| username: TiDBer_AGzauEA1 | Original post link

Unable to upload images. It looks like the Unified read pool peak can reach 1300%. After adjusting readpool.unified.max-thread-count to 10, the peak stabilizes around 700, with a performance increase of 3 times, but the CPU usage is still around 20% overall.

| username: redgame | Original post link

Which version of the go-ycsb tool is this?

| username: changpeng75 | Original post link

That definitely won’t be fully utilized. With this amount of data on a single NVME, the basic functionality of a TiKV node consumes limited CPU, unless computation pushdown is being performed.

| username: changpeng75 | Original post link

96 cores or 96 threads? EPYC 9474F?

| username: TiDBer_aaO4sU46 | Original post link

What is this tool?

| username: TiDBer_AGzauEA1 | Original post link

The basic functions of a TiKV node with 96 cores consume limited CPU. Specifically, which basic functions are being referred to, and which module is the bottleneck? The IOPS is still far from the theoretical IOPS of 8K read/write for a single NVME disk, right?

| username: TiDBer_AGzauEA1 | Original post link

1.0.1

| username: changpeng75 | Original post link

TiKV uses MemTable to dump to disk, so it is sequential writing rather than random writing. Therefore, you should look at MBPS instead of IOPS. To check if the bottleneck is in the hard disk is actually quite simple: add two more hard disks and see if the CPU usage increases.

| username: TiDBer_5Vo9nD1u | Original post link

Running large complex SQL