Consultation on tikv_server Memory Issues

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv_server 内存问题咨询

| username: 等一分钟

Is there no memory OOM issue with the tikv_server node? In the production environment, I want to allocate some memory from the tikv_server node to the tidb_server node.

| username: 等一分钟 | Original post link

Why doesn’t the tidb_server node use the LRU algorithm to remove old data instead of causing an OOM?

| username: TiDBer_pkQ5q1l0 | Original post link

OOM is an operating system behavior.

| username: 等一分钟 | Original post link

The memory usage of the tidb_server node doesn’t seem to be very high, right?

| username: TiDBer_pkQ5q1l0 | Original post link

Encountering a bug will cause continuous increase, eventually leading to OOM (Out of Memory).

| username: Jellybean | Original post link

There definitely will be.
This is my response to another TiKV OOM thread (tidb tikv 节点内存不停增长到oom 限制大小被kill 重启后继续增长 - #8,来自 TiDBer_ZsnVPQB4 - TiDB 的问答社区)

For reference only:

TiKV node memory OOM generally occurs in two situations:

  • TiKV block cache is set too large
  • The data fetched by the coprocessor from TiKV is cached in TiKV memory relatively quickly, and the speed at which gRPC sends the read data to the TiDB server is relatively slow, leading to data accumulation and OOM

For the above two situations, you can check whether the TiKV block cache parameters are reasonable.
If the above parameters are reasonable, then investigate the SQL access of the cluster at that time. You can view it through the dashboard, or log into the machine to find the slow query logs in the TiDB server, and grep for expensive queries in the tidb.log. In most cases, you will find the corresponding SQL.

Then look at the execution plan of the corresponding SQL, analyze what the problem is, and then find the corresponding handling strategy.

| username: 等一分钟 | Original post link

Well, I rarely encounter this issue, but it’s always the tidb_server memory that gets overloaded.

| username: 人如其名 | Original post link

When you place TiKV on the TiDB server, TiKV itself has cache issues that consume a lot of memory. If TiDB server statements cause memory usage to increase, the operating system will, by default, kill the process that occupies the most memory, which will result in TiKV being killed. For critical systems, it is best to separate TiDB server and TiKV.

| username: tidb菜鸟一只 | Original post link

The TiKV server node also has the risk of memory OOM. In a production environment, if you move the memory of the TiKV server node to the TiDB server node, it may lead to the TiKV process being killed if either process encounters OOM. It’s important to know that killing the TiDB process is generally inconsequential, but killing the TiKV process has a significant impact on the business. Therefore, it is definitely more reasonable to separate them. If you really want to put them together, you can only use NUMA binding to prevent TiDB and TiKV from affecting each other.