TiKV Randomly Restarts

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv不定时重启

| username: liujq132

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.1.1
[Reproduction Path] Operations performed that led to the issue
Two servers, each with 3 TiKV nodes. Intermittent database write failures with the error message (8027, ‘Information schema is out of date: schema failed to update in 1 lease, please make sure TiDB can connect to TiKV’). Later, it was discovered that a TiKV node had restarted.
[Encountered Issue: Symptoms and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
Logs of the two restarted TiKV nodes.zip (4.9 MB)
TiDB Dashboard Diagnosis Report.html (883.0 KB)

| username: zhanggame1 | Original post link

Do KV nodes have logs?

| username: liujq132 | Original post link

Hello, I have updated the logs. I’m not even a beginner in this field. The company is moving away from IOE and setting up a TiDB cluster. This cluster originally consisted of 4 servers (one virtual machine and three Inspur Haiguang physical machines). Previously, the servers frequently crashed. The symptom was that when large amounts of data were written, Grafana showed that the IO was constantly at 100%, and after a while, the system would crash. Later, it seemed that the 5400 RPM mechanical disks were not performing well, so we added two Huawei servers and placed TiKV on the two Huawei servers. Each machine has 6 TiKV instances, and although they are still mechanical disks, they are 10,000 RPM, which is somewhat better than before. However, recently, we have also experienced TiKV restarts.

| username: zhanggame1 | Original post link

Mechanical drives are not very suitable; distributed databases have much higher IO requirements than regular databases. You need to use SSDs, preferably NVMe ones.

| username: liujq132 | Original post link

Are there any parameters or settings that can be adjusted to allow it to be slow but stable without restarting?

| username: 胡杨树旁 | Original post link

Check the system logs? Did an OOM (Out of Memory) event cause the restart, or check the Grafana monitoring to see the memory usage of the restarted node?

| username: tidb菜鸟一只 | Original post link

Check the value of the storage.block-cache.capacity parameter for each TiKV node. Additionally, check the total memory of the host shared by the three TiKV nodes and whether NUMA resource isolation has been implemented.

| username: liujq132 | Original post link

The memory of both servers is almost fully used.

| username: liujq132 | Original post link

The default value of tidb_dml_batch_size is 2000. You can adjust this parameter to control the number of rows in a single batch of DML operations.

| username: tidb菜鸟一只 | Original post link

There is definitely a problem with this. Your machine only has 256GB of memory, and there are three TiKV nodes on one machine. Therefore, the storage.block-cache.capacity for each TiKV node should be set to 256/3*0.45=38.4GB, which is more reasonable. Additionally, you can specify each TiKV to bind to a NUMA node through tiup cluster edit-config cluster_name, which can prevent resource contention among the three nodes.

| username: liujq132 | Original post link

now.config (7.8 KB) Cluster configuration has been uploaded [root@tidb-118 ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Stepping: 4
CPU MHz: 2499.975
CPU max MHz: 3000.0000
CPU min MHz: 800.0000
BogoMIPS: 4400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39 Both physical machines only have 2 NUMA nodes. Is it not feasible to allocate three TiKV instances? If we add another server and allocate two TiKV instances per machine, would that be okay?

| username: tidb菜鸟一只 | Original post link

Yes, it is generally not recommended to deploy multiple TiKV instances on a single machine. If resources are tight, you can deploy two instances, binding one to NUMA node 0 and the other to NUMA node 1 to isolate memory and CPU. Additionally, it is best to mount the two TiKV instances on different disks to isolate I/O. This way, you can minimize the impact between the two TiKV instances.

| username: liujq132 | Original post link

At present, it seems to be resolved. It should be that setting storage.block-cache.capacity to 100G caused an OOM. After reducing it, there has been no restart so far.

| username: Anna | Original post link

storage.block-cache.capacity

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.