After deploying multiple TiKV instances on a single machine, frequent OOM occurrences are observed

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv单机多实例部署以后,发现经常会出现oom

| username: Vilinder

Deployment details are as follows:
A total of 6 servers
3 servers deploy TiDB and PD, each with 3 instances
3 servers deploy TiKV (3 instances per server) and TiFlash (1 instance per server), totaling 9 TiKV and 3 TiFlash

Server configuration: Hygon G5 7285 (2.0GHz/32 cores/64MB/190W) ×2 + 512G memory

Using

dmesg -T | grep tidb-server or dmesg -T | grep oom

to check the kv servers, it can be seen that the kv service experiences OOM and restarts every 1 to 3 hours daily.
The tikv_stderr.log is empty.

The global storage.block-cache.capacity has been configured to half of 512G memory/4, which is 62GB.

Global configuration is as follows:

  tikv:
    readpool.storage.use-unified-pool: false
    readpool.coprocessor.use-unified-pool: true
    storage.block-cache.capacity: 62GB
    raftstore.apply-pool-size: 3
    raftstore.store-pool-size: 3
    server.grpc-concurrency: 8

Single TiKV configuration is as follows:

tikv_servers:
  - host: 192.168.1.64
    ssh_port: 22
    port: 20160
    status_port: 20180
    deploy_dir: "/data/1/tidb-deploy/tikv-20160"
    data_dir: "/data/1/tidb-data/tikv-20160"
    log_dir: "/data/1/tidb-deploy/tikv-20160/log"
    numa_node: "0"
    config:
      server.labels:
        host: tikv64
  - host: 192.168.1.64
    ssh_port: 22
    port: 20161
    status_port: 20181
    deploy_dir: "/data/2/tidb-deploy/tikv-20161"
    data_dir: "/data/2/tidb-data/tikv-20161"
    log_dir: "/data/2/tidb-deploy/tikv-20161/log"
    numa_node: "1"
    config:
      server.labels:
        host: tikv64
  - host: 192.168.1.64
    ssh_port: 22
    port: 20162
    status_port: 20182
    deploy_dir: "/data/3/tidb-deploy/tikv-20162"
    data_dir: "/data/3/tidb-data/tikv-20162"
    log_dir: "/data/3/tidb-deploy/tikv-20162/log"
    numa_node: "2"
    config:
      server.labels:
        host: tikv64

Grafana->TiKV-Detail->RocksDB-KV->Block Cache Size panel is as follows:

The max of each TiKV instance has not reached the set 62GB.

| username: Billmay表妹 | Original post link

What version?

| username: Vilinder | Original post link

tidb7.0.0

| username: WalterWj | Original post link

Check the memory usage of resolve lock ts…

| username: 人如其名 | Original post link

In a mixed deployment, it is highly likely that the tidb-server is consuming a lot of memory, causing the operating system to run out of overall memory and kill the tikv process that is using the most memory. You should check the tidb-related panels to confirm if the tidb memory is increasing. If so, check the logs for expensive SQL statements to see if they are consuming a lot of memory.

| username: tidb菜鸟一只 | Original post link

Directly deploy on the TiKV and TiFlash machines to check which process is occupying the most memory.

| username: 孤君888 | Original post link

Didn’t you use the NUMA binding tool for your mixed deployment?

| username: xingzhenxiang | Original post link

I have deployed four TiKV servers on one machine and haven’t encountered any OOM issues so far. Do you have a separate disk for each server?

| username: 胡杨树旁 | Original post link

Check the memory usage of each server on the overview panel → system info. After locating the server, separately check whether it is a TiKV or TiFlash issue.