Questions about TiDB NUMA Core Binding?

translator_bot · June 23, 2024, 6:48am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于TiDB NUMA绑核的疑问？

| username: OnTheRoad

In the official TiDB installation and deployment documentation, it only mentions the need to install the numactl package on each node. However, it does not mention how to perform NUMA binding to improve performance. Is this NUMA binding process automatically configured by the instances of each TiDB role based on the host hardware? If it is not automatically configured, how can we optimize the configuration based on the hardware setup?

translator_bot · June 23, 2024, 6:48am

| username: HACK | Original post link

You need to manually specify the NUMA node in the configuration file.

translator_bot · June 23, 2024, 6:48am

| username: 半瓶醋仙 | Original post link

It is definitely necessary to manually configure the code and NUMA binding strategy to further avoid cross-die memory access within the same CPU. At the same time, a more refined NUMA node binding strategy also brings better isolation of computing resources, reducing the mutual interference of multiple instances deployed on the same server.

translator_bot · June 23, 2024, 6:48am

| username: ddhe9527 | Original post link

Refer to this blog

translator_bot · June 23, 2024, 6:48am

| username: OnTheRoad | Original post link

We currently have each component on a dedicated physical server. In this case, isn’t it considered a waste of resources?

translator_bot · June 23, 2024, 6:48am

| username: 半瓶醋仙 | Original post link

In my opinion, the definition of resource waste lies within a reasonable budget. TiFlash, as a high-performance machine, requires a separate high-configuration physical server. Some monitoring components can be considered for integrated deployment on physical servers.

translator_bot · June 23, 2024, 6:48am

| username: OnTheRoad | Original post link

Currently, the monitoring component is on the PD node; TiDB Server and TiKV each occupy a separate physical server per instance. If there are 2 nodes on the TiDB Server host with 64G of memory, wouldn’t binding the TiDB instance to a NUMA node only use half the memory, leaving 32G idle? Or do I have a misunderstanding of NUMA binding?

translator_bot · June 23, 2024, 6:48am

| username: tidb狂热爱好者 | Original post link

NUMA can only use half of the memory.

translator_bot · June 23, 2024, 6:48am

| username: OnTheRoad | Original post link

The implication is that when deploying a TiDB Server (using TiDB Server as an example) on a server with 2 NUMA Nodes, in order to fully utilize memory and CPU resources and achieve the highest possible performance, the best practice is to deploy 2 TiDB Server instances and bind them to NUMA nodes. The memory capacity should be twice the originally planned requirement.
Is this understanding correct?

translator_bot · June 23, 2024, 6:48am

| username: OnTheRoad | Original post link

Currently, TiDB (64GB memory) / PD (64GB memory) / TiKV (128GB memory) each occupy a dedicated physical machine, with a total of 3 TiDB / 3 PD / 3 TiKV, making up 9 servers. Monitoring and HA are deployed independently. The business scenario is mainly OLAP. For this scenario, can performance be improved by binding cores with NUMA? Currently, TiDB experiences OOM 1-2 times per month, which has been found to be due to large transactions and SQL query issues.

translator_bot · June 23, 2024, 6:48am

| username: alfred | Original post link

In the production environment, after binding NUMA to the cores, there have been instances where CPU utilization maxed out, and the specific cause has not yet been identified.

translator_bot · June 23, 2024, 6:48am

| username: Raymond | Original post link

The official recommendation is to deploy as many TiDB servers as there are NUMA nodes, with each TiDB server bound to one NUMA node.

translator_bot · June 23, 2024, 6:48am

| username: Raymond | Original post link

The official documentation explains this.

translator_bot · June 23, 2024, 6:48am

| username: forever | Original post link

Are you using x86 or ARM architecture?

translator_bot · June 23, 2024, 6:48am

| username: OnTheRoad | Original post link

translator_bot · June 23, 2024, 6:48am

| username: h5n1 | Original post link

NUMA binding in TiDB involves setting the numa_node parameter for each component, which will add numactl --cpunodebind=X,X --membind=X,X to the component’s run_xxx script to start the component.
When the memory resources of a single node are insufficient, multiple nodes must be used; otherwise, a large query in TiDB will result in an OOM (Out of Memory) error. NUMA is more suitable for machines with ample resources. NUMA itself has many allocation strategies. The default strategy is to first use the memory of the node where the process is located. If the memory is insufficient, it will obtain memory from other nodes. Since the distance of other nodes is relatively far, the latency compared to accessing local memory will be higher.

translator_bot · June 23, 2024, 6:48am

| username: Raymond | Original post link

The memory allocation strategy in the NUMA binding strategy provided by TiDB should only obtain memory from the local NUMA node.

translator_bot · June 23, 2024, 6:48am

| username: forever | Original post link

After binding NUMA, a single instance will use less memory resources but will be faster. Therefore, it is easier to encounter OOM (Out of Memory) after binding. For example, Oracle recommends not binding cores to avoid OOM. The issue of binding cores has a greater impact under the ARM architecture, so it’s quite a dilemma.

translator_bot · June 23, 2024, 6:48am

| username: OnTheRoad | Original post link

If the physical machine configuration is high enough and the memory is sufficient, is it still necessary to do NUMA binding? By default, doesn’t it prioritize accessing the memory within the same node? Or does a request switch between different CPUs?

translator_bot · June 23, 2024, 6:48am

| username: forever | Original post link

Even if resources are high, if there are multiple nodes, data may still reside in remote memory, which means accessing it will involve crossing NUMA.