TiKV node fails to start, forced offline operation shows "offline"

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV节点无法启动,强制下线失败显示offline

| username: Ann_ann

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] TiKV node fails to start after machine reboot with error:

System log message shows OOM:

Forced offline of this TiKV node, current status is offline

Business log shows java.sql.SOLException: TiKV server is busy error

Check PD log:
[WARN] [cluster.go:607] [“store may not turn into Tombstone, there are no extra up node has enough space to accommodate the extra replica”] [store="id:5 address:"xxx:20160" state:Offline version:"3.0.16" "]

How can I force offline this TiKV or how to repair this cluster (besides upgrading)?

| username: tidb菜鸟一只 | Original post link

How many TiKV nodes are running now?

| username: 像风一样的男子 | Original post link

Check the cluster status. Is PD still alive?

| username: TiDBer_小阿飞 | Original post link

PD already mentioned that there isn’t enough space to replicate the offline node’s replica to other nodes, combined with your OOM issue.

| username: dba远航 | Original post link

It feels like the service failure is caused by insufficient space, leading to a denial of service.

| username: 小龙虾爱大龙虾 | Original post link

Be careful when operating in the production environment.

| username: xingzhenxiang | Original post link

–force can forcibly clear such non-communicative nodes, but in a production environment, you need to consider safety before handling it.

| username: 随缘天空 | Original post link

The error message indicates that it may be caused by insufficient disk space or improper node configuration. Open the dashboard monitoring panel and check the remaining disk space of each node.

| username: zhanggame1 | Original post link

display to check the status

| username: Ann_ann | Original post link

There were 4 in total, one died, and currently, there are still three running.

| username: Ann_ann | Original post link

I’m here.

| username: Ann_ann | Original post link

Then just expand the disk?

| username: Ann_ann | Original post link

Disk space is sufficient, and memory has also been expanded.

| username: Ann_ann | Original post link

This is version 3, there is no dashboard. I checked the disk and it is sufficient.

| username: TiDBer_小阿飞 | Original post link

The total capacity size of your TiKV node is 983GB, and the available size is 422GB, which means 983-422=561GB has been used.

So, your other three nodes need to absorb this 561GB of data. If they are sufficient, then you need to check other areas.

| username: 像风一样的男子 | Original post link

It looks like the disk read/write permissions are causing the error.

| username: TiDBer_小阿飞 | Original post link

Let’s check the TIKV panel on Grafana. Are there any anomalies in the Cluster and Errors panels?

| username: tidb菜鸟一只 | Original post link

Then you either expand the disk space of the current 3 active TiKV nodes or add another TiKV node. Otherwise, the regions on the failed TiKV node have nowhere to migrate, and it cannot be taken offline.

| username: 随缘天空 | Original post link

Oh, but your version is too old, it’s already up to version 7 now.

| username: Kongdom | Original post link

If the cluster is still in a normal state, can this offline TiKV node be forcibly scaled down? However, there is no plan for forced scaling down in the 3.0 documentation, so I’m not sure if directly executing step 3 would work.
https://docs-archive.pingcap.com/zh/tidb/v3.0/scale-tidb-using-ansible#缩容-tikv-节点

However, judging from the error message, it seems to be migrating data. It is best not to forcibly scale down. First, expand the disk space and see if it can be normally scaled down and taken offline.

store may not turn into Tombstone, there are no extra up node has enough space to accommodate the extra replica