TiKV Cluster Fails to Start After Reboot Due to OOM

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv集群重启后-oom起不来

| username: Ann_ann

[TiDB Usage Environment] Production Environment
[TiDB Version] TiDB-v3.0.16
[Reproduction Path] After the cluster restarts, the TiKV node fails to start and is continuously killed by OOM.
[Encountered Problem: Symptoms and Impact]
System error log:

TiKV log:

| username: Fly-bird | Original post link

Take a look at the resource utilization of TiKV in the cluster.

| username: 芮芮是产品 | Original post link

We have to wait for the official response. TiKV OOM is always a bug.

| username: Ann_ann | Original post link

Low utilization, expanding the memory didn’t help either.

| username: xingzhenxiang | Original post link

If the version is even older than my 3.1.0, and only this node cannot start, can we consider handling this issue by expanding a new node and shrinking this node?

| username: 像风一样的男子 | Original post link

It’s too old, consider upgrading.

| username: xingzhenxiang | Original post link

The tone of this upgrade is very official, haha.

| username: 像风一样的男子 | Original post link

There’s nothing I can do, I can’t find the documentation for version 3.0. Sorry, I can’t help.

| username: xingzhenxiang | Original post link

Here’s the documentation link, no need to thank me :joy_cat:

| username: Ann_ann | Original post link

Just planning to do it this way.

| username: xingzhenxiang | Original post link

Our goal is to solve the problem, not to dwell on it, haha.

| username: Ann_ann | Original post link

Is there a way to solve this?

| username: xingzhenxiang | Original post link

I installed v3.1.0 and configured it in the installation file. I’m not sure if this version can be configured like this:
storage.block-cache.capacity: 15G
image

| username: 芮芮是产品 | Original post link

First, try adding more memory to see if it will still OOM.

| username: Ming | Original post link

It depends on the environment. Is it a mixed deployment? What is the size of the blockcache setting? How much memory does the server itself have?
Generally, under default circumstances, TiKV rarely experiences OOM (Out of Memory) issues. It is possible that other programs on your server are occupying memory, causing the server memory to reach its limit and resulting in TiKV being killed.

| username: Ann_ann | Original post link

Expanding the memory didn’t help. The memory usage was around 16GB, but after the cluster failure and restart, when starting TiKV, it kept maxing out the memory and causing OOM, leading to it being killed.

| username: 芮芮是产品 | Original post link

They asked if you deployed TiKV and TiDB on the same machine.

| username: ShawnYan | Original post link

How about limiting it with cgroup? Have you tried it before?

| username: yulei7633 | Original post link

If there is enough memory but still OOM, check the permissions in the second image. If it still doesn’t work, try scaling down and then scaling up again.

| username: 路在何chu | Original post link

The version is even older than ours.