High IO Utilization on TiKV Nodes

translator_bot · June 23, 2024, 12:13pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv节点io占用率高

| username: TiDBer_iGcIvrah

【TiDB Usage Environment】Production Environment or Test Environment or POC: Production
【TiDB Version】5.3
【Encountered Issue】High IO usage on tikv nodes
【Problem Phenomenon and Impact】
Currently, the production environment TiDB cluster has 5 tikv nodes, some configured with 4 cores and 8GB, and some with 16 cores and 32GB. The data disk is of the AWS gps3 type (SSD) with a baseline IOPS of 3000. Currently, all tikv nodes maintain an IOPS of 1500, with read/write rates around 10MB/s and latency around 1s. The encountered issue is very similar to the one described in this blog http://laddyq.com/article/36733.html. According to the blog, disk mount parameters were increased to commit=60, data=writeback, and tikv configuration parameters were set to sync-log=false, but it still did not work.

tikv version
# tiup tikv:v5.3.0 --version
Starting component tikv: /root/.tiup/components/tikv/v5.3.0/tikv-server --version
TiKV
Release Version: 5.3.0
Edition: Community
Git Commit Hash: 6c1424706f3d5885faa668233f34c9f178302f36
Git Commit Branch: heads/refs/tags/v5.3.0
UTC Build Time: 2021-11-19 16:24:14
Rust Version: rustc 1.56.0-nightly (2faabf579 2021-07-27)
Enable Features: jemalloc mem-profiling portable sse protobuf-codec test-engines-rocksdb cloud-aws cloud-gcp
Profile: dist_release

TiUP Cluster Display Information

tikv node host monitoring

tidb real-time maximum QPS only 300

tikv node io situation

TiUP Cluster Edit Config Information

global:
user: tidb
ssh_port: 22
ssh_type: builtin
deploy_dir: /tidb/deploy
data_dir: /tidb/data
os: linux
arch: amd64
monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115
deploy_dir: /tidb/deploy/monitor-9100
data_dir: /tidb/data/monitor-9100
log_dir: /tidb/deploy/monitor-9100/log
server_configs:
tidb:
log.enable-timestamp: true
log.file.max-backups: 3
log.file.max-days: 3
log.level: info
oom-action: log
tikv:
raftdb.allow-concurrent-memtable-write: true
raftdb.max-background-jobs: 4
raftstore.apply-pool-size: 3
raftstore.store-pool-size: 3
readpool.storage.normal-concurrency: 3
readpool.unified.max-thread-count: 8
readpool.unified.min-thread-count: 3
rocksdb.max-background-jobs: 4
server.grpc-concurrency: 2
storage.scheduler-worker-pool-size: 10
pd:
log.file.max-backups: 3
log.file.max-days: 3
log.level: INFO
metric.interval: 15s
schedule.max-merge-region-keys: 200000
schedule.max-merge-region-size: 20
schedule.patrol-region-interval: 15ms
tiflash: {}
tiflash-learner: {}
pump: {}
drainer: {}
cdc: {}
tidb_servers:

host: 172.23.16.120
ssh_port: 22
port: 4000
status_port: 10080
deploy_dir: /tidb/deploy/tidb-01
log_dir: /tidb/deploy/tidb-01/log
arch: amd64
os: linux
tikv_servers:
host: 172.23.25.132
ssh_port: 22
port: 20160
status_port: 20180
deploy_dir: /tidb/deploy/tikv-03
data_dir: /tidb/data/tikv-03
log_dir: /tidb/deploy/tikv-03/log
config:
raftdb.allow-concurrent-memtable-write: true
raftdb.max-background-jobs: 12
raftstore.apply-pool-size: 9
raftstore.store-pool-size: 9
readpool.storage.normal-concurrency: 9
readpool.unified.max-thread-count: 16
readpool.unified.min-thread-count: 9
rocksdb.max-background-jobs: 8
server.grpc-concurrency: 8
arch: amd64
os: linux
host: 172.23.8.223
ssh_port: 22
port: 20160
status_port: 20180
deploy_dir: /tidb/deploy/tikv-04
data_dir: /tidb/data/tikv-04
log_dir: /tidb/deploy/tikv-04/log
arch: amd64
os: linux
host: 172.23.12.99
ssh_port: 22
port: 20160
status_port: 20180
deploy_dir: /tidb/deploy/tikv-05
data_dir: /tidb/data/tikv-05
log_dir: /tidb/deploy/tikv-05/log
arch: amd64
os: linux
host: 172.23.17.81
ssh_port: 22
port: 20160
status_port: 20180
deploy_dir: /tidb/deploy/tikv-07
data_dir: /tidb/data/tikv-07
log_dir: /tidb/log/tikv-07/
config:
raftdb.allow-concurrent-memtable-write: true
raftdb.max-background-jobs: 12
raftstore.apply-pool-size: 9
raftstore.store-pool-size: 9
raftstore.sync-log: false
readpool.storage.normal-concurrency: 9
readpool.unified.max-thread-count: 16
readpool.unified.min-thread-count: 8
rocksdb.defaultcf.max-write-buffer-number: 12
rocksdb.defaultcf.write-buffer-size: 1024MB
rocksdb.max-background-jobs: 12
server.grpc-concurrency: 8
storage.block-cache.capacity: 2GB
storage.scheduler-worker-pool-size: 15
arch: amd64
os: linux
host: 172.23.8.190
ssh_port: 22
port: 20160
status_port: 20180
deploy_dir: /tidb/deploy/tikv-08
data_dir: /tidb/data/tikv-08
log_dir: /tidb/log/tikv-08
config:
raftstore.sync-log: false
arch: amd64
os: linux
tiflash_servers:
host: 172.23.18.56
ssh_port: 22
tcp_port: 9000
http_port: 8123
flash_service_port: 3930
flash_proxy_port: 20170
flash_proxy_status_port: 20292
metrics_port: 8234
deploy_dir: /tidb-deploy/tiflash-9000
data_dir: /tidb/data/tiflash-9000
log_dir: /tidb-deploy/tiflash-9000/log
arch: amd64
os: linux
pd_servers:
host: 172.23.16.120
ssh_port: 22
name: pd-03
client_port: 2379
peer_port: 2380
deploy_dir: /tidb/deploy/pd-03
data_dir: /tidb/data/pd-03
log_dir: /tidb/deploy/pd-03/log
arch: amd64
os: linux
host: 172.23.18.56
ssh_port: 22
name: pd-04
client_port: 2379
peer_port: 2380
deploy_dir: /tidb/deploy/pd-04
data_dir: /tidb/data/pd-04
log_dir: /tidb/deploy/pd-04/log
arch: amd64
os: linux

TiDB- Overview Monitoring

Corresponding module logs (including logs 1 hour before and after the issue) →

translator_bot · June 23, 2024, 12:13pm

| username: ddhe9527 | Original post link

Check the TiKV Details → RocksDB KV → Block Cache hit to see the Block Cache hit rate. Given that your resource configuration is not high, with memory being either 8GB or 32GB, the Block Cache won’t be very large, which will increase some physical I/O.

translator_bot · June 23, 2024, 12:13pm

| username: TiDBer_iGcIvrah | Original post link

I upgraded the instance configuration to 8 cores and 64GB. The TiKV node configuration is as follows:

host: 172.23.8.190
ssh_port: 22
port: 20160
status_port: 20180
deploy_dir: /tidb/deploy/tikv-08
data_dir: /tidb/data/tikv-08
log_dir: /tidb/log/tikv-08
config:
raftdb.max-background-jobs: 12
raftstore.apply-pool-size: 9
raftstore.store-pool-size: 9
raftstore.sync-log: false
readpool.storage.normal-concurrency: 9
readpool.unified.max-thread-count: 16
readpool.unified.min-thread-count: 9
rocksdb.defaultcf.max-write-buffer-number: 12
rocksdb.defaultcf.write-buffer-size: 512MB
rocksdb.max-background-jobs: 8
rocksdb.writecf.max-write-buffer-number: 12
rocksdb.writecf.write-buffer-size: 512MB
server.grpc-concurrency: 8
storage.block-cache.capacity: 36GB
arch: amd64

In a single-threaded write scenario
TiKV node phenomenon:
Disk IOPS is always only 1500, write latency is 1s without change, node memory usage is not high.
Node monitoring:

TiKV blockcache monitoring:

translator_bot · June 23, 2024, 12:13pm

| username: CuteRay | Original post link

It seems to be a disk issue.
Quoting @h5n1:

Cloud disk performance is poor, and TiDB has a relatively large amount of read and write operations. Every operation and lock requires writing to the Raft log, and compaction involves a lot of read and write operations.
IOPS is divided into read and write parts. The high IOPS claimed by cloud disks are mostly achieved by leveraging cache to improve read IOPS. Disk performance also includes bandwidth and fdatasync. TiKV needs to perform disk sync operations when writing data to ensure that the data has been flushed from the buffer to the hardware before returning to the business side, specifically through the fdatasync system call.
TiKV disk recommendations are a write bandwidth of over 2GB/s and more than 20K fdatasync operations per second. In tests with high-concurrency direct writes of 4KB, P99.99 should be less than 3ms. You can use the latest version of fio or the pg_test_fsync tool for testing. You can add the -fdatasync=1 option to test, for example, high concurrency with each write of 4k and each fsync:
fio -direct=0 -fdatasync=1 -iodepth=4 -thread=4 -rw=write -ioengine=libaio -bs=4k -filename=./fio_test -size=20G -runtime=60 -group_reporting -name=write_test
Performance reference for fdatasync:
Reference 1: Non-NVMe SSD fdatasync/s is about 5~8K/s
Reference 2: Early NVMe fdatasync/s is about 20~50K/s
Reference 3: Current mature PCIe 3 NVMe is about 200~500K/s