TiKV Service Unexpectedly Restarts

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV服务莫名重启

| username: TiDBer_ssvwtrcq

[TiDB Usage Environment] Testing
[TiDB Version] v6.5.2
[Reproduction Path] A newly deployed cluster for less than 24 hours, only performed TPC-H testing and the data generation part of TPC-C testing. After being idle for a period (overnight), 2 KV nodes automatically restarted at 8:49 AM.
After investigation and analysis, the following logs were found in the operating system logs:

Apr 25 08:49:04 tikv119 kernel: Out of memory: Kill process 4583 (tikv-server) score 917 or sacrifice child 
Apr 25 08:49:04 tikv119 kernel: Killed process 4583 (tikv-server), UID 0, total-vm:44109576kB, anon-rss:30524348kB, file-rss:520kB, shmem-rss:0kB 
Apr 25 08:49:08 tikv119 systemd: tikv-20160.service: main process exited, code=killed, status=9/KILL 
Apr 25 08:49:08 tikv119 systemd: Unit tikv-20160.service entered failed state. 
Apr 25 08:49:08 tikv119 systemd: tikv-20160.service failed. 
Apr 25 08:49:23 tikv119 systemd: tikv-20160.service holdoff time over, scheduling restart. 
Apr 25 08:49:23 tikv119 systemd: Stopped tikv service. 
Apr 25 08:49:23 tikv119 systemd: Started tikv service. 
Apr 25 08:49:23 tikv119 bash: sync ... 
Apr 25 08:49:23 tikv119 bash: real#0110m0.003s 
Apr 25 08:49:23 tikv119 bash: user#0110m0.000s 
Apr 25 08:49:23 tikv119 bash: sys#0110m0.001s 
Apr 25 08:49:23 tikv119 bash: ok

[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration]
1 Server, 1 PD node, 3 KV nodes.
[Attachments: Screenshots/Logs/Monitoring]
PD node logs
pd.log (3.2 KB)

Server node logs
tidb-server.log (13.6 KB)

Logs of one of the KV nodes
tikv119.log (3.3 MB)

| username: xfworld | Original post link

Out of memory…

| username: TiDBer_ssvwtrcq | Original post link

I know it’s OOM, but the reason is unclear. There were no query operations all night.

| username: TiDBer_ssvwtrcq | Original post link

Suspect that the memory overflow is caused by a system bug.

| username: tidb菜鸟一只 | Original post link

SHOW config WHERE TYPE=‘tikv’ AND NAME LIKE ‘%storage.block-cache.capacity%’;–Check what this parameter is set to

| username: TiDBer_ssvwtrcq | Original post link

| username: TiDBer_ssvwtrcq | Original post link

The physical memory configuration of the KV machine is 32GB.

| username: tidb菜鸟一只 | Original post link

Are there other processes running on TiKV? It’s not a mixed deployment, right?

| username: TiDBer_ssvwtrcq | Original post link

The provided logs may not be accurate. Just discovered that there is a 30-minute time difference between the kv node and the server node. It is not a mixed deployment.

| username: TiDBer_ssvwtrcq | Original post link

The logs of the KV node are accurate. The KV node is 30 minutes slower than the PD and server nodes.

| username: ffeenn | Original post link

What is the actual memory usage on the server right now? Check the resource usage at that point in time on the monitoring system.

| username: xfworld | Original post link

How did you install your cluster? TiUP will perform an environment check, and if the environment is not okay, it cannot be installed…

Such a big time difference…

| username: TiDBer_ssvwtrcq | Original post link

The server motherboard’s clock had an issue. We have rebuilt the cluster. Thank you, everyone.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.