TiKV Disk Full, Cluster Inaccessible

translator_bot · June 23, 2024, 4:04am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV磁盘跑满，集群无法访问

| username: 普罗米修斯

[TiDB Usage Environment] Production environment
[TiDB Version] TiDB v3.0.3
[Encountered Issue] Two KV nodes (each 1T) were taken offline without adding new disks beforehand, causing other nodes to become full and the cluster to experience access anomalies.
[Reproduction Path]

Added a 1T SSD KV node, a 128G KV node, and a 1T HDD KV node;
Deleted the log files of the overloaded KV node and the LOG.old files in data/db/, then restarted the node, but the available space remained high;
The newly added 1T SSD KV node and 128G KV node are currently full, causing the cluster to disconnect. After deleting the log files and the LOG.old files in data/db/, the usage remains at 100%. These two newly added KV nodes cannot start, and the database is inaccessible;
Adjusted replica-schedule-limit to 64 to speed up the offline process;
[Issue Phenomenon and Impact]
The two offline KV nodes have been offline for 6 days, with 8G of data remaining. The download speed is particularly slow, taking about 10 hours to offload less than 1G of data;
The database is inaccessible.

[Attachment]

translator_bot · June 23, 2024, 4:04am

| username: 普罗米修斯 | Original post link

Excuse me:

How to quickly decommission the previous KV node;
Is there any way to bring up the disconnected KV node? If it cannot be brought up, what should be the next step?

translator_bot · June 23, 2024, 4:04am

| username: h5n1 | Original post link

Check the pd-ctl config show. The process of scaling in and out involves migrating regions, which will generate a lot of IO. Increasing the limit for concurrency will instead put more pressure on the disk.

translator_bot · June 23, 2024, 4:04am

| username: 普罗米修斯 | Original post link

How can I bring up a downed node with a full disk?
How can I quickly take offline a KV node that is currently being taken offline?

translator_bot · June 23, 2024, 4:04am

| username: 普罗米修斯 | Original post link

We are now planning to use
operator add transfer-region
operator add transfer-leader
to transfer the regions and leaders from the KV nodes that are being decommissioned for a quick decommissioning. Is this operation feasible?

translator_bot · June 23, 2024, 4:04am

| username: Billmay表妹 | Original post link

You need to upgrade your version as soon as possible~ Your version is too outdated~

translator_bot · June 23, 2024, 4:04am

| username: 普罗米修斯 | Original post link

Once this issue is resolved, we will upgrade the version. Please take a look at the current issue first; it’s a production environment and quite urgent. Thank you.

translator_bot · June 23, 2024, 4:04am

| username: TiDBer_jYQINSnf | Original post link

In this version, it’s uncertain whether there is a place_holder_file. If there is, modify it.
[storage]
Change reserve-space = “5GB” to 0 to free up some space.

translator_bot · June 23, 2024, 4:04am

| username: TiDBer_jYQINSnf | Original post link

I’m not sure if modifying the store state through PD is supported in version 3.0. If anyone else knows, please feel free to add.

Your issue is caused by the offline process. You can simply restart the offline TiKV.
http://pdip:2379/pd/api/v1/store/storeid/state?state=Up

Then, the 2TB of space that was offline will be available.
Next, check the replicas to ensure there are three replicas.

If you confirm that there are three normal replicas, you can force delete the full node. Alternatively, ensure that the Regions on the full node have two normal replicas on other nodes, then you can delete it. Be very careful with this step.
curl -X POST http://{pdip}:2379/pd/api/v1/store/${store_id}/state?state=Tombstone
The remaining two replicas can be completed on the newly started TiKV.

translator_bot · June 23, 2024, 4:04am

| username: 普罗米修斯 | Original post link

This version does not have the place_holder_file file.

translator_bot · June 23, 2024, 4:04am

| username: 普罗米修斯 | Original post link

The offline TiKV node has reduced from 2TB of data to 8GB.

Currently, if the node with a full disk can come back online and the database can resume access, I will just wait for the KV node to go offline, even if it takes longer.
If the node with a full disk cannot come back online, how can I speed up the offline process to proceed with subsequent operations?
I have set scheduler add evict-leader-scheduler, but I observed that the number of leaders on the offline KV node has not decreased. Is it okay to manually remove regions using operator add transfer-region?
For the third step, I saw online that manual removal is not recommended and that we should wait for automatic offline. However, the current offline speed is extremely slow. How can I troubleshoot this or adjust parameters to speed it up?

translator_bot · June 23, 2024, 4:04am

| username: TiDBer_jYQINSnf | Original post link

The nodes are completely full, not even a byte left, right? Then there’s no way to move. Even if you migrate the region out, RocksDB will issue a delete command, but RocksDB actually appends data, which also requires disk space.

Interactions between Raft nodes also need to append Raft logs, which also require disk space.

The node with the full disk is most likely beyond saving. Check if the regions on it have the majority of replicas on other regions. If they do, you can physically delete it.

If the data is very important, first shut down the machine and copy the 128GB of data to another machine with a larger hard drive, then start TiKV. At this point, there might be errors due to IP addresses or other issues. If you plan to do this, check in advance how to modify the store metadata in PD.

translator_bot · June 23, 2024, 4:04am

| username: 普罗米修斯 | Original post link

Is physical deletion safe? I see that the cluster is currently performing the remove-pending-down-replica operation on this downed KV node. What is the purpose of this?

translator_bot · June 23, 2024, 4:04am

| username: TiDBer_jYQINSnf | Original post link

The prerequisite for physical deletion is that each region has a normal 2 replicas, which is considered safe.
I couldn’t find relevant information about remove-pending-down-replica, so I’m not clear about the specific logic.

translator_bot · June 23, 2024, 4:04am

| username: 普罗米修斯 | Original post link

Manually removing the peer from the offline node has been completed, but now a new issue has arisen. The database can occasionally read but cannot write.

Checking the TiDB logs reports [error=“[tikv:9005]Region is unavailable”]
[2022/09/15 13:48:42.692 +08:00] [WARN] [session.go:960] [“run statement error”] [conn=158283] [schemaVersion=154442] [error=“[tikv:9005]Region is unavailable”] [errorVerbose=“[tikv:9005]Region is unavailable
github.com/pingcap/errors.AddStack
\t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174
github.com/pingcap/errors.Trace
\t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).onRegionError
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:234
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:129
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReq
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:72
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:168
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:130
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:181
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:130
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:181
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:130
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:181
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:130
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:181
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:130
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:181
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:130
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:181
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:130
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:181
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:130
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:181
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:130
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:181
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:130
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion
\t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:181”] [session=“{
"currDBName": "wtlivedbrds_xa",
"id": 158283,
"status": 2,
"strictMode": true,
"user": {
"Username": "root",
"Hostname": "192.168.90.208",
"CurrentUser": false,
"AuthUsername": "root",
"AuthHostname": "%"
}
}”]

translator_bot · June 23, 2024, 4:04am

| username: TiDBer_jYQINSnf | Original post link

Check how many replicas this region has and where they are located. See if everything is normal.

translator_bot · June 23, 2024, 4:04am

| username: 普罗米修斯 | Original post link

Currently, there are still two stores in the cluster that are down due to disk full issues. The number of down-peer-region-count and pending-peer-region-count in the region-health chart has decreased from 67.5K to 4.59K but is not decreasing further. The miss_peer_region_count data is currently dropping sharply. Should we wait for the database to balance out before it can return to normal?

translator_bot · June 23, 2024, 4:04am

| username: 普罗米修斯 | Original post link

There is no indication of which region in the logs.

translator_bot · June 23, 2024, 4:04am

| username: TiDBer_jYQINSnf | Original post link

There is no forced deletion, right? It’s just that two stores are down, causing the cluster to fail to start, right?
If that’s the case, then just be patient and wait.

You can use tidb-ctl to check regions, to see which regions each table has. Then check if these regions are normal.

Also, don’t you have any other solutions for your disk being full? Can’t you copy the data to another machine and start TiKV there? It’s just the IP that has changed, there should be a way to recover. Look for specific methods when you decide to do this.

translator_bot · June 23, 2024, 4:04am

| username: TiDBer_jYQINSnf | Original post link

There are ways to expand the disk as well. If using LVM, you can just add more space. If that doesn’t work, you can mount a remote directory using NFS. Copy the 128G data directory to the remote disk. There are many methods.