Why does tidb-server keep restarting?

translator_bot · June 21, 2024, 9:55pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb-server不停重启，为什么

| username: 逍遥_猫

[Test Environment] TiDB
[TiDB Version] v6.1.1
[Reproduction Path] After shutting down the virtual machine and restarting, the tidb-server remains down
[Encountered Problem: Phenomenon and Impact]
After shutting down and restarting the virtual machine, the tidb-server remains down. Only one node is deployed for tidb-server, while pd and tikv each have three nodes.
The tidb-server is deployed on node B. Upon inspection, it is found that the TIDB-SERVER service process is always present, but it automatically restarts approximately every 2 minutes. Even if the process is killed, it will restart.
This is a newly deployed cluster with no data. Before restarting, the command tiup cluster clean tidb1 --all was executed.
Memory and CPU are sufficient.

Seeking advice from experts, what could possibly cause the tidb-server to keep restarting?

translator_bot · June 21, 2024, 9:55pm

| username: zhanggame1 | Original post link

Check the TiDB logs, and also check if the memory is insufficient.

translator_bot · June 21, 2024, 9:55pm

| username: 逍遥_猫 | Original post link

Memory and CPU are sufficient, but this TiDB error is quite strange.

In fact, the disks of the three nodes are only used less than 50%.

translator_bot · June 21, 2024, 9:55pm

| username: 我是咖啡哥 | Original post link

df -ih
Check if the inodes are full.

translator_bot · June 21, 2024, 9:55pm

| username: 逍遥_猫 | Original post link

translator_bot · June 21, 2024, 9:55pm

| username: zhanggame1 | Original post link

If TiKV disk is full and it’s not just slow, there might be an issue with TiKV. Check the TiKV logs.

translator_bot · June 21, 2024, 9:55pm

| username: 逍遥_猫 | Original post link

TiKV log error

Why does it report that the region has no leader when using clean --all or clean --data?
Checked with pd-ctl, and it is indeed as the error message suggests.

translator_bot · June 21, 2024, 9:55pm

| username: ShawnYan | Original post link

Virtual machine? Any changes in the network? Is communication with PD normal?

translator_bot · June 21, 2024, 9:55pm

| username: Kongdom | Original post link

Is the space size in K?

translator_bot · June 21, 2024, 9:55pm

| username: Kongdom | Original post link

Oh, I remember a situation where a cluster was frequently adding and deleting, and eventually, we found that the monitoring showed insufficient space, but the physical disk space was sufficient. In the end, it seemed that either restarting the cluster or scaling it up or down made the monitoring normal.

It was like the monitoring showed unreleased space, but the physical space was actually released. It felt like the monitoring statistics were not updated.

translator_bot · June 21, 2024, 9:55pm

| username: Fly-bird | Original post link

Try restarting each node one by one.

translator_bot · June 21, 2024, 9:55pm

| username: Kongdom | Original post link

I checked it out, and that’s it. In the end, we had to rebuild the cluster~

translator_bot · June 21, 2024, 9:55pm

| username: tidb菜鸟一只 | Original post link

You can check the disk usage of TiKV from the Grafana monitoring page to see if the cluster has not been cleaned up properly.