Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 搭出来的Tidb集群读性能差,cpu、内存、磁盘利用率都非常低
Cluster topology
Resource utilization:
Diagnostic results:
The test command is
sysbench --config-file=config1 oltp_point_select --tables=5 --table-size=100000000 --threads=64 run
It feels like the resources are not being fully utilized. I don’t know which part of the process is going wrong.
What is the CPU usage of the sysbench server? The config can be set with multiple IPs without using load balancing. If there are no bottlenecks in sysbench, load balancing, or the database, then increase the pressure. If 64 is not enough, increase it to 256.
The CPU of the sysbench server is far from being fully utilized, and increasing the load has not brought any performance improvement.
Also, your cluster deployment is not very reasonable. Why are there two servers for TiKV, and one of them is a dual instance?
Because I only found two high-performance hard drives.
Haha, it looks like a hastily put together server.
Please send the overview screenshots of CPU and memory. Let’s first check the CPU, memory, and IO status of the server.
2 Tikv servers
Tikv+Tidb servers
Tidb+PD servers
It feels far from the bottleneck
The QPS of point_select measured in this way is only 30k+
I can’t understand, did you make any configurations?
Send the content of tiup cluster config-show
to see if the configuration is as expected.
global:
user: tidb
ssh_port: 22
ssh_type: builtin
deploy_dir: /data/tidb-deploy
data_dir: /data/tidb-data
os: linux
arch: arm64
monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: /data/tidb-deploy/monitor-9100
data_dir: /data/tidb-data/monitor-9100
log_dir: /data/tidb-deploy/monitor-9100/log
server_configs:
tidb:
log.level: error
performance.max-procs: 32
performance.run-auto-analyze: false
performance.txn-total-size-limit: 10485760000
prepared-plan-cache.enabled: true
token-limit: 3001
tikv:
coprocessor.split-region-on-table: false
log-level: error
raftdb.max-background-jobs: 16
raftstore.apply-max-batch-size: 1024
raftstore.apply-pool-size: 3
raftstore.hibernate-regions: true
raftstore.raft-max-inflight-msgs: 1024
raftstore.store-max-batch-size: 1024
raftstore.store-pool-size: 8
rocksdb.compaction-readahead-size: 2M
rocksdb.defaultcf.max-write-buffer-number: 32
rocksdb.writecf.max-write-buffer-number: 32
server.grpc-concurrency: 5
server.max-grpc-send-msg-len: 5242880
storage.block-cache.capacity: 64G
storage.reserve-space: 0MB
storage.scheduler-worker-pool-size: 8
pd:
replication.location-labels:
- zone
- host
tidb_dashboard: {}
tiflash: {}
tiflash-learner: {}
pump: {}
drainer: {}
cdc: {}
kvcdc: {}
grafana: {}
tidb_servers:
- host: 10.10.12.9
ssh_port: 22
port: 4000
status_port: 10080
deploy_dir: /data/tidb-deploy/tidb-4000
log_dir: /data/tidb-deploy/tidb-4000/log
numa_node: "0"
arch: arm64
os: linux
- host: 10.10.12.78
ssh_port: 22
port: 4000
status_port: 10080
deploy_dir: /data/tidb-deploy/tidb-4000
log_dir: /data/tidb-deploy/tidb-4000/log
numa_node: "1"
arch: arm64
os: linux
tikv_servers:
- host: 10.10.12.78
ssh_port: 22
port: 20160
status_port: 20180
deploy_dir: /data/tidb-deploy/tikv-20160
data_dir: /data/tidb-data/tikv-20160
log_dir: /data/tidb-deploy/tikv-20160/log
numa_node: "0"
config:
server.labels:
host: h1
zone: z0
arch: arm64
os: linux
- host: 10.10.12.6
ssh_port: 22
port: 20160
status_port: 20180
deploy_dir: /data/tidb-deploy/tikv-20160
data_dir: /data/tidb-data/tikv-20160
log_dir: /data/tidb-deploy/tikv-20160/log
numa_node: "0"
config:
server.labels:
host: h2
zone: z0
arch: arm64
os: linux
- host: 10.10.12.6
ssh_port: 22
port: 20161
status_port: 20181
deploy_dir: /data/tidb-deploy/tikv-20161
data_dir: /data/tidb-data/tikv-20161
log_dir: /data/tidb-deploy/tikv-20161/log
numa_node: "1"
config:
server.labels:
host: h2
zone: z1
arch: arm64
os: linux
tiflash_servers: []
pd_servers:
- host: 10.10.12.9
ssh_port: 22
name: pd-10.10.12.9-2379
client_port: 2379
peer_port: 2380
deploy_dir: /data/tidb-deploy/pd-2379
data_dir: /data/tidb-data/pd-2379
log_dir: /data/tidb-deploy/pd-2379/log
numa_node: "1"
arch: arm64
os: linux
monitoring_servers:
- host: 10.10.12.9
ssh_port: 22
port: 9090
deploy_dir: /data/tidb-deploy/prometheus-9090
data_dir: /data/tidb-data/prometheus-9090
log_dir: /data/tidb-deploy/prometheus-9090/log
external_alertmanagers: []
arch: arm64
os: linux
grafana_servers:
- host: 10.10.12.9
ssh_port: 22
port: 3000
deploy_dir: /data/tidb-deploy/grafana-3000
arch: arm64
os: linux
username: admin
password: admin
anonymous_enable: false
root_url: ""
domain: ""
alertmanager_servers:
- host: 10.10.12.9
ssh_port: 22
web_port: 9093
cluster_port: 9094
deploy_dir: /data/tidb-deploy/alertmanager-9093
data_dir: /data/tidb-data/alertmanager-9093
log_dir: /data/tidb-deploy/alertmanager-9093/log
arch: arm64
os: linux
Are you using the default configuration for TiKV? One server has 2 nodes, each node with 40 cores. From what I remember reading the documentation, there are some configurations that need to be changed because I put two TiKV instances on one server.
Yes, if it’s a 2-node setup with each node having 40 vCPUs, binding the cores means that the corresponding components can use up to 40 vCPUs.
You just need to configure the parameters required for deploying multiple TiKV instances, such as memory, CPU, and capacity. You can remove the other parameters.
What are the main components needed for the CPU? It seems that there are quite a few parameters related to the number of threads.
Okay, I’ll give it a try, thanks!!
After all the operations, the QPS still hasn’t changed much.
It feels like several resources are still not being utilized, which is quite strange
server_configs:
tidb:
log.level: error
performance.run-auto-analyze: false
performance.txn-total-size-limit: 10485760000
prepared-plan-cache.enabled: true
token-limit: 6000
tikv:
#coprocessor.split-region-on-table: false
log-level: error
raftstore.capacity: 1000GB
readpool.coprocessor.use-unified-pool: true
readpool.storage.use-unified-pool: true
readpool.unified.max-thread-count: 32
storage.block-cache.capacity: 64G
storage.block-cache.shared: true
pd:
replication.location-labels:
- zone
- host
The configuration file should be fine, right?
By the way, I also want to confirm if it’s possible that the topology structure is causing this, but I don’t think it should have such a big impact If it doesn’t work, I’ll try adding another disk.
The server is not performing well.
Your machine is ARM, and ARM itself is relatively weak
You need to bind the cores properly.
Run some stress tests and check the numastate to see how much memory is being accessed across nodes.
Solve these issues, adjust the thread pool, and there are a lot of optimizations specific to ARM machines. Basically, what could be done by one thread before might now require several threads. Otherwise, it will be a bottleneck.
It is unreasonable. In theory, when no obvious bottleneck is found, increasing concurrency will definitely increase CPU usage.