TIKV Out of Memory (OOM)

translator_bot · June 22, 2024, 12:23pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV OOM

| username: Atlan

[TiDB Usage Environment] Production Environment / Testing / Poc
System Version: Ubuntu 20.04
TiDB Version: v.6.5.1 v6.5.0
[Reproduction Path]
During e2e testing, memory usage continues to grow even after the pressure stabilizes. If long-term e2e testing is conducted, it will trigger OOM.
[Encountered Problem: Problem Phenomenon and Impact]

Trigger OOM restart
[Resource Configuration]
Host Resource Configuration:
CPU: 16 cores
Memory: 128G
Mechanical Disk: 20T

Monitoring Screenshot

Heap pprof
000001.heap (55.9 KB)

translator_bot · June 22, 2024, 12:23pm

| username: Atlan | Original post link

metricstools tidb-test-TiKV-Details_2023-03-28T08_59_55.204Z.json (14.0 MB)

translator_bot · June 22, 2024, 12:23pm

| username: Atlan | Original post link

@h5n1

translator_bot · June 22, 2024, 12:23pm

| username: Atlan | Original post link

Situation Description:
Three hosts: 16C 128G 20T*3
Three TiDB, three PD, six TiKV (with multiple disks), and sufficient host resources.
Note: Mixed with business, but confirmed that it is not the business or other basic components occupying resources that triggered OOM, causing TiKV to be mistakenly killed.

translator_bot · June 22, 2024, 12:24pm

| username: magic | Original post link

Did you send it again?

translator_bot · June 22, 2024, 12:24pm

| username: xfworld | Original post link

If mixed deployment is used, pay attention to the memory configuration of TiKV. Otherwise, if the system memory is insufficient, Linux will kill TiKV as a zombie process.

translator_bot · June 22, 2024, 12:24pm

| username: Atlan | Original post link

Yes, the previous one sank. Now, the business line can’t handle the scheduled restarts anymore.

translator_bot · June 22, 2024, 12:24pm

| username: Atlan | Original post link

Is there a recommended value for this memory configuration? My host has 128G of memory, running two TiKV instances should be sufficient, right? Other programs don’t really use much memory.

translator_bot · June 22, 2024, 12:24pm

| username: xfworld | Original post link

Hybrid deployment should be stress-tested in the testing environment.

Refer to these articles:

translator_bot · June 22, 2024, 12:24pm

| username: tidb狂热爱好者 | Original post link

You should study the basic theory before making another post.

translator_bot · June 22, 2024, 12:24pm

| username: tidb狂热爱好者 | Original post link

If an OOM occurs, just kill the process with the highest memory usage. Which process is using the most memory? TiKV limits its usage to 80%. How much memory do your other services use?

translator_bot · June 22, 2024, 12:24pm

| username: tidb菜鸟一只 | Original post link

SHOW config WHERE TYPE=‘tikv’ AND NAME LIKE ‘%storage.block-cache.capacity%’;
Check this parameter. If a single TiKV is deployed on one server, set it to 45% of the server’s memory. If two TiKVs are deployed on one server, set it to 22.5% of the server’s memory. Additionally, if two TiKVs are deployed on one server, you can bind each TiKV to a single NUMA node by specifying numa_node to prevent mutual interference.

translator_bot · June 22, 2024, 12:24pm

| username: Atlan | Original post link

I’ll take a look, thank you.

translator_bot · June 22, 2024, 12:24pm

| username: Atlan | Original post link

Okay, thank you. I’ll take a look.

translator_bot · June 22, 2024, 12:24pm

| username: Atlan | Original post link

I don’t quite understand what you mean. The total memory of my business is less than 20G.

translator_bot · June 22, 2024, 12:24pm

| username: xfworld | Original post link

Memory usage will accumulate, and if it’s a mixed deployment, it’s best to use cgroup for isolation.

translator_bot · June 22, 2024, 12:24pm

| username: Atlan | Original post link

If using cgroup isolation, it might be a bit more complex for operations.

translator_bot · June 22, 2024, 12:24pm

| username: Atlan | Original post link

We originally had slow disk writes but couldn’t add more machines, so we added disks and mixed two TiKV instances. The disk write latency is very high with SAS disks.

translator_bot · June 22, 2024, 12:24pm

| username: Atlan | Original post link

Default cache

translator_bot · June 22, 2024, 12:24pm

| username: h5n1 | Original post link

The default cache size of TiKV is about 45% of the memory. Manually set the size, try running it at 32G first.