How to increase the replication speed on TiFlash?

TiDB version: 7.1.0

Problem:

I’m trying to create replica tables in TiFlash. The write speed to TiFlash disks is about 30 MiB per second. This is slow when replicating several terabytes. How to increase the replication speed on TiFlash?

Resource allocation:

3 hosts. Characteristics of each:

  • 2 CPU Intel Xeon 2.2GHz, 40 cores, 80 threads
  • 768GB RAM
  • 10 HDD of 5.5TB each
  • 2 network interface card of 10Gbit/sec each

Cluster Configuration:
1 host: 1 PD, 1 TiDB, 2 TiKV, 1 TiFlash (storage - 2 hdd)
2 host: 1 PD, 1 TiDB, 2 TiKV, 1 TiFlash (storage - 2 hdd)
3 host: 1 PD, 1 TiDB, 2 TiKV, 1 TiFlash (storage - 2 hdd)

Configuration:

global:
user: “tidb”
ssh_port: 22
deploy_dir: “/data10/tidb-deploy”
data_dir: “/data9/tidb-data”

monitored:
node_exporter_port: 9100
blackbox_exporter_port: 9115

server_configs:
tidb:
log.slow-threshold: 300
tikv:
readpool.unified.max-thread-count: 24
readpool.storage.use-unified-pool: true
readpool.coprocessor.use-unified-pool: true
storage.scheduler-concurrency: 1024000
pd:
replication.location-labels: [“host”]
replication.enable-placement-rules: true
schedule.leader-schedule-limit: 4
schedule.region-schedule-limit: 2048
schedule.replica-schedule-limit: 64

pd_servers:

  • host: 10.116.173.70
    client_port: 2382
    peer_port: 2383
    deploy_dir: “/data1/tidb-deploy/pd-2382”
    log_dir: “/data1/tidb-deploy/pd-2382/log”
    data_dir: “/data2/tidb-data/pd-2382”
  • host: 10.116.173.71
    deploy_dir: “/data1/tidb-deploy/pd-2379”
    log_dir: “/data1/tidb-deploy/pd-2379/log”
    data_dir: “/data2/tidb-data/pd-2379”
  • host: 10.116.173.72
    deploy_dir: “/data1/tidb-deploy/pd-2379”
    log_dir: “/data1/tidb-deploy/pd-2379/log”
    data_dir: “/data2/tidb-data/pd-2379”

tidb_servers:

  • host: 10.116.173.70
    deploy_dir: “/data1/tidb-deploy/tidb-4000”
    log_dir: “/data1/tidb-deploy/tidb-4000/log”
  • host: 10.116.173.71
    deploy_dir: “/data1/tidb-deploy/tidb-4000”
    log_dir: “/data1/tidb-deploy/tidb-4000/log”
  • host: 10.116.173.72
    deploy_dir: “/data1/tidb-deploy/tidb-4000”
    log_dir: “/data1/tidb-deploy/tidb-4000/log”

tikv_servers:

  • host: 10.116.173.70
    port: 20160
    status_port: 20180
    deploy_dir: “/data3/tidb-deploy/tikv-20160”
    log_dir: “/data3/tidb-deploy/tikv-20160/log”
    data_dir: “/data4/tidb-data/tikv-20160”
    config:
    server.labels: { host: “tikv1” }
  • host: 10.116.173.70
    port: 20161
    status_port: 20181
    deploy_dir: “/data5/tidb-deploy/tikv-20161”
    log_dir: “/data5/tidb-deploy/tikv-20161/log”
    data_dir: “/data6/tidb-data/tikv-20161”
    config:
    server.labels: { host: “tikv1” }
  • host: 10.116.173.71
    port: 20160
    status_port: 20180
    deploy_dir: “/data3/tidb-deploy/tikv-20160”
    log_dir: “/data3/tidb-deploy/tikv-20160/log”
    data_dir: “/data4/tidb-data/tikv-20160”
    config:
    server.labels: { host: “tikv2” }
  • host: 10.116.173.71
    port: 20161
    status_port: 20181
    deploy_dir: “/data5/tidb-deploy/tikv-20161”
    log_dir: “/data5/tidb-deploy/tikv-20161/log”
    data_dir: “/data7/tidb-data/tikv-20161”
    config:
    server.labels: { host: “tikv2” }
  • host: 10.116.173.72
    port: 20160
    status_port: 20180
    deploy_dir: “/data3/tidb-deploy/tikv-20160”
    log_dir: “/data3/tidb-deploy/tikv-20160/log”
    data_dir: “/data4/tidb-data/tikv-20160”
    config:
    server.labels: { host: “tikv3” }
  • host: 10.116.173.72
    port: 20161
    status_port: 20181
    deploy_dir: “/data5/tidb-deploy/tikv-20161”
    log_dir: “/data5/tidb-deploy/tikv-20161/log”
    data_dir: “/data6/tidb-data/tikv-20161”
    config:
    server.labels: { host: “tikv3” }

tiflash_servers:

  • host: 10.116.173.70
    config:
    storage.main.dir: [“/data8/tidb-data/tiflash-9000”, “/data9/tidb-data/tiflash-9000”]
    storage.main.capacity: [3298534883328, 3298534883328]
  • host: 10.116.173.71
    config:
    storage.main.dir: [“/data8/tidb-data/tiflash-9000”, “/data9/tidb-data/tiflash-9000”]
    storage.main.capacity: [3298534883328, 3298534883328]
  • host: 10.116.173.72
    config:
    storage.main.dir: [“/data8/tidb-data/tiflash-9000”, “/data9/tidb-data/tiflash-9000”]
    storage.main.capacity: [3298534883328, 3298534883328]

monitoring_servers:

  • host: 10.116.173.69
    port: 9091

grafana_servers:

  • host: 10.116.173.69

alertmanager_servers:

  • host: 10.116.173.69

According to the official documentation , before TiFlash replicas are added, each TiKV instance performs a full table scan and sends the scanned data to TiFlash as a “snapshot” to create replicas. By default, TiFlash replicas are added slowly with fewer resources usage in order to minimize the impact on the online service. If there are spare CPU and disk IO resources in your TiKV and TiFlash nodes, you can accelerate TiFlash replication by temporarily increasing the snapshot write speed limit for each TiKV and TiFlash instance.

To increase the snapshot write speed limit, you can use the Dynamic Config SQL statement. The default value for both configurations is 100MiB, i.e. the maximum disk bandwidth used for writing snapshots is no more than 100MiB/s. You can temporarily increase the snapshot write speed limit for each TiKV and TiFlash instance by executing the following command:

tiup ctl:v7.1.0 pd -u http://<PD_ADDRESS>:2379 store limit all engine tiflash 60 add-peer

Within a few minutes, you will observe a significant increase in CPU and disk IO resource usage of the TiFlash nodes, and TiFlash should create replicas faster. At the same time, the TiKV nodes’ CPU and disk IO resource usage increases as well. If the TiKV and TiFlash nodes still have spare resources at this point and the latency of your online service does not increase significantly, you can further ease the limit, for example, triple the original speed:

tiup ctl:v7.1.0 pd -u http://<PD_ADDRESS>:2379 store limit all engine tiflash 90 add-peer

After the TiFlash replication is complete, you should revert to the default configuration to reduce the impact on online services. You can execute the following PD Control command to restore the default new replica speed limit:

tiup ctl:v7.1.0 pd -u http://<PD_ADDRESS>:2379 store limit all engine tiflash 30 add-peer

Please note that the above steps are not applicable to TiDB Cloud .

Added: Use TiFlash | PingCAP Docs