In the official TiDB installation and deployment documentation, it only mentions the need to install the numactl package on each node. However, it does not mention how to perform NUMA binding to improve performance. Is this NUMA binding process automatically configured by the instances of each TiDB role based on the host hardware? If it is not automatically configured, how can we optimize the configuration based on the hardware setup?
It is definitely necessary to manually configure the code and NUMA binding strategy to further avoid cross-die memory access within the same CPU. At the same time, a more refined NUMA node binding strategy also brings better isolation of computing resources, reducing the mutual interference of multiple instances deployed on the same server.
In my opinion, the definition of resource waste lies within a reasonable budget. TiFlash, as a high-performance machine, requires a separate high-configuration physical server. Some monitoring components can be considered for integrated deployment on physical servers.
Currently, the monitoring component is on the PD node; TiDB Server and TiKV each occupy a separate physical server per instance. If there are 2 nodes on the TiDB Server host with 64G of memory, wouldn’t binding the TiDB instance to a NUMA node only use half the memory, leaving 32G idle? Or do I have a misunderstanding of NUMA binding?
The implication is that when deploying a TiDB Server (using TiDB Server as an example) on a server with 2 NUMA Nodes, in order to fully utilize memory and CPU resources and achieve the highest possible performance, the best practice is to deploy 2 TiDB Server instances and bind them to NUMA nodes. The memory capacity should be twice the originally planned requirement.
Is this understanding correct?
Currently, TiDB (64GB memory) / PD (64GB memory) / TiKV (128GB memory) each occupy a dedicated physical machine, with a total of 3 TiDB / 3 PD / 3 TiKV, making up 9 servers. Monitoring and HA are deployed independently. The business scenario is mainly OLAP. For this scenario, can performance be improved by binding cores with NUMA? Currently, TiDB experiences OOM 1-2 times per month, which has been found to be due to large transactions and SQL query issues.
In the production environment, after binding NUMA to the cores, there have been instances where CPU utilization maxed out, and the specific cause has not yet been identified.
NUMA binding in TiDB involves setting the numa_node parameter for each component, which will add numactl --cpunodebind=X,X --membind=X,X to the component’s run_xxx script to start the component.
When the memory resources of a single node are insufficient, multiple nodes must be used; otherwise, a large query in TiDB will result in an OOM (Out of Memory) error. NUMA is more suitable for machines with ample resources. NUMA itself has many allocation strategies. The default strategy is to first use the memory of the node where the process is located. If the memory is insufficient, it will obtain memory from other nodes. Since the distance of other nodes is relatively far, the latency compared to accessing local memory will be higher.
After binding NUMA, a single instance will use less memory resources but will be faster. Therefore, it is easier to encounter OOM (Out of Memory) after binding. For example, Oracle recommends not binding cores to avoid OOM. The issue of binding cores has a greater impact under the ARM architecture, so it’s quite a dilemma.
If the physical machine configuration is high enough and the memory is sufficient, is it still necessary to do NUMA binding? By default, doesn’t it prioritize accessing the memory within the same node? Or does a request switch between different CPUs?