The TiDB cluster built has poor read performance, with very low CPU, memory, and disk utilization

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 搭出来的Tidb集群读性能差,cpu、内存、磁盘利用率都非常低

| username: Leox

Cluster topology

Resource utilization:



Diagnostic results:

The test command is
sysbench --config-file=config1 oltp_point_select --tables=5 --table-size=100000000 --threads=64 run

It feels like the resources are not being fully utilized. I don’t know which part of the process is going wrong. :sob:

| username: WalterWj | Original post link

What is the CPU usage of the sysbench server? The config can be set with multiple IPs without using load balancing. If there are no bottlenecks in sysbench, load balancing, or the database, then increase the pressure. If 64 is not enough, increase it to 256.

| username: Leox | Original post link

The CPU of the sysbench server is far from being fully utilized, and increasing the load has not brought any performance improvement.

| username: WalterWj | Original post link

Also, your cluster deployment is not very reasonable. Why are there two servers for TiKV, and one of them is a dual instance?

| username: Leox | Original post link

Because I only found two high-performance hard drives. :joy:

| username: WalterWj | Original post link

Haha, it looks like a hastily put together server.

| username: WalterWj | Original post link

Please send the overview screenshots of CPU and memory. Let’s first check the CPU, memory, and IO status of the server.

| username: Leox | Original post link

2 Tikv servers

Tikv+Tidb servers

Tidb+PD servers

It feels far from the bottleneck :dotted_line_face:

The QPS of point_select measured in this way is only 30k+

| username: WalterWj | Original post link

I can’t understand, did you make any configurations?
Send the content of tiup cluster config-show to see if the configuration is as expected.

| username: Leox | Original post link

global:
  user: tidb
  ssh_port: 22
  ssh_type: builtin
  deploy_dir: /data/tidb-deploy
  data_dir: /data/tidb-data
  os: linux
  arch: arm64
monitored:
  node_exporter_port: 9100
  blackbox_exporter_port: 9115
  deploy_dir: /data/tidb-deploy/monitor-9100
  data_dir: /data/tidb-data/monitor-9100
  log_dir: /data/tidb-deploy/monitor-9100/log
server_configs:
  tidb:
    log.level: error
    performance.max-procs: 32
    performance.run-auto-analyze: false
    performance.txn-total-size-limit: 10485760000
    prepared-plan-cache.enabled: true
    token-limit: 3001
  tikv:
    coprocessor.split-region-on-table: false
    log-level: error
    raftdb.max-background-jobs: 16
    raftstore.apply-max-batch-size: 1024
    raftstore.apply-pool-size: 3
    raftstore.hibernate-regions: true
    raftstore.raft-max-inflight-msgs: 1024
    raftstore.store-max-batch-size: 1024
    raftstore.store-pool-size: 8
    rocksdb.compaction-readahead-size: 2M
    rocksdb.defaultcf.max-write-buffer-number: 32
    rocksdb.writecf.max-write-buffer-number: 32
    server.grpc-concurrency: 5
    server.max-grpc-send-msg-len: 5242880
    storage.block-cache.capacity: 64G
    storage.reserve-space: 0MB
    storage.scheduler-worker-pool-size: 8
  pd:
    replication.location-labels:
    - zone
    - host
  tidb_dashboard: {}
  tiflash: {}
  tiflash-learner: {}
  pump: {}
  drainer: {}
  cdc: {}
  kvcdc: {}
  grafana: {}
tidb_servers:
- host: 10.10.12.9
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: /data/tidb-deploy/tidb-4000
  log_dir: /data/tidb-deploy/tidb-4000/log
  numa_node: "0"
  arch: arm64
  os: linux
- host: 10.10.12.78
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: /data/tidb-deploy/tidb-4000
  log_dir: /data/tidb-deploy/tidb-4000/log
  numa_node: "1"
  arch: arm64
  os: linux
tikv_servers:
- host: 10.10.12.78
  ssh_port: 22
  port: 20160
  status_port: 20180
  deploy_dir: /data/tidb-deploy/tikv-20160
  data_dir: /data/tidb-data/tikv-20160
  log_dir: /data/tidb-deploy/tikv-20160/log
  numa_node: "0"
  config:
    server.labels:
      host: h1
      zone: z0
  arch: arm64
  os: linux
- host: 10.10.12.6
  ssh_port: 22
  port: 20160
  status_port: 20180
  deploy_dir: /data/tidb-deploy/tikv-20160
  data_dir: /data/tidb-data/tikv-20160
  log_dir: /data/tidb-deploy/tikv-20160/log
  numa_node: "0"
  config:
    server.labels:
      host: h2
      zone: z0
  arch: arm64
  os: linux
- host: 10.10.12.6
  ssh_port: 22
  port: 20161
  status_port: 20181
  deploy_dir: /data/tidb-deploy/tikv-20161
  data_dir: /data/tidb-data/tikv-20161
  log_dir: /data/tidb-deploy/tikv-20161/log
  numa_node: "1"
  config:
    server.labels:
      host: h2
      zone: z1
  arch: arm64
  os: linux
tiflash_servers: []
pd_servers:
- host: 10.10.12.9
  ssh_port: 22
  name: pd-10.10.12.9-2379
  client_port: 2379
  peer_port: 2380
  deploy_dir: /data/tidb-deploy/pd-2379
  data_dir: /data/tidb-data/pd-2379
  log_dir: /data/tidb-deploy/pd-2379/log
  numa_node: "1"
  arch: arm64
  os: linux
monitoring_servers:
- host: 10.10.12.9
  ssh_port: 22
  port: 9090
  deploy_dir: /data/tidb-deploy/prometheus-9090
  data_dir: /data/tidb-data/prometheus-9090
  log_dir: /data/tidb-deploy/prometheus-9090/log
  external_alertmanagers: []
  arch: arm64
  os: linux
grafana_servers:
- host: 10.10.12.9
  ssh_port: 22
  port: 3000
  deploy_dir: /data/tidb-deploy/grafana-3000
  arch: arm64
  os: linux
  username: admin
  password: admin
  anonymous_enable: false
  root_url: ""
  domain: ""
alertmanager_servers:
- host: 10.10.12.9
  ssh_port: 22
  web_port: 9093
  cluster_port: 9094
  deploy_dir: /data/tidb-deploy/alertmanager-9093
  data_dir: /data/tidb-data/alertmanager-9093
  log_dir: /data/tidb-deploy/alertmanager-9093/log
  arch: arm64
  os: linux
| username: WalterWj | Original post link

  1. Try not to change the configuration of TiKV if possible.
  2. I noticed that TiDB has been bound to a core, and it’s only bound to one node. How many nodes can you see with lscpu on your server? How many CPUs does each server have?
| username: Leox | Original post link

Are you using the default configuration for TiKV? One server has 2 nodes, each node with 40 cores. From what I remember reading the documentation, there are some configurations that need to be changed because I put two TiKV instances on one server. :joy:

| username: WalterWj | Original post link

Yes, if it’s a 2-node setup with each node having 40 vCPUs, binding the cores means that the corresponding components can use up to 40 vCPUs.

You just need to configure the parameters required for deploying multiple TiKV instances, such as memory, CPU, and capacity. You can remove the other parameters.

| username: Leox | Original post link

What are the main components needed for the CPU? It seems that there are quite a few parameters related to the number of threads.

| username: WalterWj | Original post link

混合部署拓扑 | PingCAP 归档文档站 This

| username: Leox | Original post link

Okay, I’ll give it a try, thanks!!

| username: Leox | Original post link

After all the operations, the QPS still hasn’t changed much.

It feels like several resources are still not being utilized, which is quite strange :disappointed_relieved:

server_configs:
  tidb:
    log.level: error
    performance.run-auto-analyze: false
    performance.txn-total-size-limit: 10485760000
    prepared-plan-cache.enabled: true
    token-limit: 6000
  tikv:
    #coprocessor.split-region-on-table: false
    log-level: error
    raftstore.capacity: 1000GB
    readpool.coprocessor.use-unified-pool: true
    readpool.storage.use-unified-pool: true
    readpool.unified.max-thread-count: 32
    storage.block-cache.capacity: 64G
    storage.block-cache.shared: true
  pd:
    replication.location-labels:
    - zone
    - host

The configuration file should be fine, right?

By the way, I also want to confirm if it’s possible that the topology structure is causing this, but I don’t think it should have such a big impact :rofl: If it doesn’t work, I’ll try adding another disk.

| username: Running | Original post link

The server is not performing well.

| username: TiDBer_jYQINSnf | Original post link

Your machine is ARM, and ARM itself is relatively weak :crazy_face:
You need to bind the cores properly.
Run some stress tests and check the numastate to see how much memory is being accessed across nodes.
Solve these issues, adjust the thread pool, and there are a lot of optimizations specific to ARM machines. Basically, what could be done by one thread before might now require several threads. Otherwise, it will be a bottleneck.

| username: WalterWj | Original post link

It is unreasonable. In theory, when no obvious bottleneck is found, increasing concurrency will definitely increase CPU usage.