TiKV Node OOM Causes Automatic Restart

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 节点oom导致自动重启

| username: 胡杨树旁

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]
Around 19:00 on November 12th, the TiKV node experienced a restart. Upon checking the TiKV node-related logs, no anomalies were found. However, upon checking the system logs, it was discovered that:


I would like to ask, won’t this memory be automatically released?

| username: Billmay表妹 | Original post link

【TiDB Version】
【Reproduction Steps】What operations were performed when the issue occurred

Please provide these details.

| username: 胡杨树旁 | Original post link

Version information: v6.1.1
Operations performed: Basically no business operations, just checked the dashboard. The time points when issues appeared were all related to some internal SQL and monitoring SQL.

| username: weixiaobing | Original post link

How is the TiKV memory parameter storage.block-cache.capacity configured? How much memory does the machine have? Is it deployed with multiple instances?

| username: h5n1 | Original post link

First, follow this to troubleshoot:

| username: 胡杨树旁 | Original post link

There are 4 clusters deployed on these 3 machines.

| username: 胡杨树旁 | Original post link

Machine configuration

| username: weixiaobing | Original post link

Multi-instance deployment, set the memory for each TiKV with storage.block-cache.capacity = (MEM_TOTAL * 0.5 / number of TiKV instances)

| username: weixiaobing | Original post link

Setting the memory too high can easily lead to OOM (Out of Memory).

| username: 胡杨树旁 | Original post link

The machine’s memory is 527517072, and the cluster is set to 214078MiB. This cluster is composed of 3 servers, with each machine deploying one TIKV node. In this case, the memory setting is reasonable. However, since these 3 servers are deploying 4 clusters, does that mean it should be 527517072 / 0.5 / 4?

| username: 胡杨树旁 | Original post link

Is the memory setting for the TiKV node too large? However, the TiKV logs do not show OOM, only the server’s operation logs indicate an OOM occurred.

| username: weixiaobing | Original post link

It’s equivalent to deploying 4 TiKV instances on one server. If that’s the case, the memory allocated to TiKV is indeed too large. The operating system logs showing OOM can confirm this point.

| username: weixiaobing | Original post link

Parameter settings for hybrid deployment: 混合部署拓扑 | PingCAP 文档中心

| username: 胡杨树旁 | Original post link

Won’t this memory be automatically released?

| username: Kongdom | Original post link

:thinking: If you have 3 machines and 4 clusters, with balanced distribution, that means each server will have 4 TiKV nodes. Isn’t that too many?

| username: Kongdom | Original post link

Memory is released slowly, not all at once.

| username: 胡杨树旁 | Original post link

In that case, it also involves sharing memory and storage, right?

| username: Kongdom | Original post link

However, since the 4 TiKV nodes on each server are not part of the same cluster, there will definitely be resource contention issues. When resources are insufficient, the Linux mechanism is likely to randomly kill processes :sweat_smile:

| username: 胡杨树旁 | Original post link

Looking at the monitoring, this is a big problem as it has not been released continuously…

| username: Kongdom | Original post link

This monitoring should be for server memory, not TiDB memory usage. You can check the memory usage of each node to see what is causing it. It might also be an issue with GC.