TiKV Node OOM Causes Automatic Restart

translator_bot · June 22, 2024, 11:46pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 节点oom导致自动重启

| username: 胡杨树旁

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]
Around 19:00 on November 12th, the TiKV node experienced a restart. Upon checking the TiKV node-related logs, no anomalies were found. However, upon checking the system logs, it was discovered that:

I would like to ask, won’t this memory be automatically released?

translator_bot · June 22, 2024, 11:46pm

| username: Billmay表妹 | Original post link

【TiDB Version】
【Reproduction Steps】What operations were performed when the issue occurred

Please provide these details.

translator_bot · June 22, 2024, 11:46pm

| username: 胡杨树旁 | Original post link

Version information: v6.1.1
Operations performed: Basically no business operations, just checked the dashboard. The time points when issues appeared were all related to some internal SQL and monitoring SQL.

translator_bot · June 22, 2024, 11:46pm

| username: weixiaobing | Original post link

How is the TiKV memory parameter storage.block-cache.capacity configured? How much memory does the machine have? Is it deployed with multiple instances?

translator_bot · June 22, 2024, 11:46pm

| username: h5n1 | Original post link

First, follow this to troubleshoot:

translator_bot · June 22, 2024, 11:46pm

| username: 胡杨树旁 | Original post link

There are 4 clusters deployed on these 3 machines.

translator_bot · June 22, 2024, 11:46pm

| username: 胡杨树旁 | Original post link

Machine configuration

translator_bot · June 22, 2024, 11:46pm

| username: weixiaobing | Original post link

Multi-instance deployment, set the memory for each TiKV with storage.block-cache.capacity = (MEM_TOTAL * 0.5 / number of TiKV instances)

translator_bot · June 22, 2024, 11:46pm

| username: weixiaobing | Original post link

Setting the memory too high can easily lead to OOM (Out of Memory).

translator_bot · June 22, 2024, 11:46pm

| username: 胡杨树旁 | Original post link

The machine’s memory is 527517072, and the cluster is set to 214078MiB. This cluster is composed of 3 servers, with each machine deploying one TIKV node. In this case, the memory setting is reasonable. However, since these 3 servers are deploying 4 clusters, does that mean it should be 527517072 / 0.5 / 4?

translator_bot · June 22, 2024, 11:46pm

| username: 胡杨树旁 | Original post link

Is the memory setting for the TiKV node too large? However, the TiKV logs do not show OOM, only the server’s operation logs indicate an OOM occurred.

translator_bot · June 22, 2024, 11:46pm

| username: weixiaobing | Original post link

It’s equivalent to deploying 4 TiKV instances on one server. If that’s the case, the memory allocated to TiKV is indeed too large. The operating system logs showing OOM can confirm this point.

translator_bot · June 22, 2024, 11:46pm

| username: weixiaobing | Original post link

Parameter settings for hybrid deployment: 混合部署拓扑 | PingCAP 文档中心

translator_bot · June 22, 2024, 11:46pm

| username: 胡杨树旁 | Original post link

Won’t this memory be automatically released?

translator_bot · June 22, 2024, 11:46pm

| username: Kongdom | Original post link

If you have 3 machines and 4 clusters, with balanced distribution, that means each server will have 4 TiKV nodes. Isn’t that too many?

translator_bot · June 22, 2024, 11:46pm

| username: Kongdom | Original post link

Memory is released slowly, not all at once.

translator_bot · June 22, 2024, 11:46pm

| username: 胡杨树旁 | Original post link

In that case, it also involves sharing memory and storage, right?

translator_bot · June 22, 2024, 11:46pm

| username: Kongdom | Original post link

However, since the 4 TiKV nodes on each server are not part of the same cluster, there will definitely be resource contention issues. When resources are insufficient, the Linux mechanism is likely to randomly kill processes

translator_bot · June 22, 2024, 11:46pm

| username: 胡杨树旁 | Original post link

Looking at the monitoring, this is a big problem as it has not been released continuously…

translator_bot · June 22, 2024, 11:46pm

| username: Kongdom | Original post link

This monitoring should be for server memory, not TiDB memory usage. You can check the memory usage of each node to see what is causing it. It might also be an issue with GC.