Cluster size help

rezam · October 16, 2024, 10:05am

Hi, the learner-peer-region-count is about 141K. is it ok?

I also have another question. according to the data, storage size and the regions size, how many nodes should I have for each component?
can I scale-in tikv nodes to have and merge cpu cores and memory (32 core, 64g memory) so I would have more performant nodes?
how many tidb nodes should I have?

pd ctl config

also here is my config of pd

{
  "replication": {
    "enable-placement-rules": "true",
    "enable-placement-rules-cache": "false",
    "isolation-level": "",
    "location-labels": "",
    "max-replicas": 3,
    "strictly-match-label": "false"
  },
  "schedule": {
    "enable-cross-table-merge": "true",
    "enable-diagnostic": "true",
    "enable-heartbeat-breakdown-metrics": "true",
    "enable-joint-consensus": "true",
    "enable-tikv-split-region": "true",
    "enable-witness": "false",
    "high-space-ratio": 0.7,
    "hot-region-cache-hits-threshold": 3,
    "hot-region-schedule-limit": 4,
    "hot-regions-reserved-days": 7,
    "hot-regions-write-interval": "10m0s",
    "leader-schedule-limit": 8,
    "leader-schedule-policy": "count",
    "low-space-ratio": 0.8,
    "max-merge-region-keys": 200000,
    "max-merge-region-size": 20,
    "max-movable-hot-peer-size": 512,
    "max-pending-peer-count": 64,
    "max-snapshot-count": 64,
    "max-store-down-time": "30m0s",
    "max-store-preparing-time": "48h0m0s",
    "merge-schedule-limit": 8,
    "patrol-region-interval": "10ms",
    "region-schedule-limit": 2048,
    "region-score-formula-version": "v2",
    "replica-schedule-limit": 64,
    "slow-store-evicting-affected-store-ratio-threshold": 0.3,
    "split-merge-interval": "1h0m0s",
    "store-limit-version": "v1",
    "switch-witness-interval": "1h0m0s",
    "tolerant-size-ratio": 0,
    "witness-schedule-limit": 4
  }
}

Application environment:

on virtual machines. described below in resource section.

region size: about 400,000
storage size: 45TB

TiDB version:

8.1.0

Problem:

in monito

Resource allocation:

27 tikv nodes (16core, 32g memory)
3 tiflash nodes (32core, 100g memory)
3 tidb nodes (16core, 32g memory)
3 pd nodes (8 core, 16g memory)
1 tiproxy
all disks are ssd enterprise

Attachment:

Lucien · October 17, 2024, 1:56am

Please, don’t care about numbers with learner peers who using TiFlash on the cluster. Because of In TiFlash, the columnar replicas are asynchronously replicated according to the Raft Learner consensus algorithm.. As the best performance practice for TiDB, 3-4TB of storage size for each TiKV node is the best advice.

Could you provide more detailed information for this cluster using the Clinic to let us know the cluster performance provides advice? PingCAP Clinic Overview

TiDB-AI-Assistant · October 17, 2024, 1:56am

To provide detailed diagnostic information for your TiDB cluster using the PingCAP Clinic, you can utilize the Clinic Server, which is a cloud-based service provided by PingCAP. This service securely stores uploaded diagnostic data collected by PingCAP Clinic and facilitates online diagnosis by technical support staff, improving troubleshooting efficiency.

Using PingCAP Clinic for Cluster Diagnostics

Diagnostic Data Collection: The Clinic collects diagnostic data from your TiDB cluster, including information from TiDB, TiKV, PD, TiFlash, TiCDC, Prometheus monitoring, system variables, and node system information.
Clinic Server: The collected diagnostic data is uploaded to the Clinic Server, which provides a diagnostic service in the SaaS model. It allows for storing, viewing, and analyzing diagnostic data to generate health reports and insights into the cluster’s health and stability.
Health Report: The Clinic Server provides a Health Report feature that analyzes the uploaded diagnostic data and offers insights into the health and stability of your TiDB cluster.
Technical Support: The Clinic Server provides online diagnosis capabilities for technical support staff to troubleshoot cluster problems.

Impact of Learner-Peer-Region-Count on Cluster Performance

The learner-peer-region-count of 141K, which includes TiFlash learner peers, can impact the cluster’s performance and configuration in several ways:

Asynchronous Replication: TiFlash uses learner peers for asynchronous replication, which helps in offloading analytical workloads without impacting transactional performance. This setup can improve read performance for analytical queries while maintaining transactional consistency.
Storage and Resource Management: With a high number of learner peers, it’s crucial to ensure that the storage and compute resources are adequately provisioned. The best practice of having 3-4TB of storage per TiKV node helps in managing the data efficiently and ensures that the nodes are not overwhelmed.
Cluster Configuration: The configuration of the cluster, including the placement rules and store limits, should be optimized to handle the high number of learner peers. This includes ensuring that the TiFlash nodes are properly configured for multi-disk deployment and that the I/O rate limits are set appropriately to manage the data flow.

By leveraging the PingCAP Clinic and understanding the impact of learner peers, you can ensure that your TiDB cluster is well-optimized for both transactional and analytical workloads.

: PingCAP Clinic Overview
: Diagnostic Data Collection
: Clinic Server
: Health Report
: Technical Support
: TiFlash Overview
: TiDB Best Practices
: TiFlash Configuration