[TiDB Usage Environment] Production environment
[TiDB Version] TiDB v3.0.3
[Encountered Issue] Two KV nodes (each 1T) were taken offline without adding new disks beforehand, causing other nodes to become full and the cluster to experience access anomalies.
[Reproduction Path]
Added a 1T SSD KV node, a 128G KV node, and a 1T HDD KV node;
Deleted the log files of the overloaded KV node and the LOG.old files in data/db/, then restarted the node, but the available space remained high;
The newly added 1T SSD KV node and 128G KV node are currently full, causing the cluster to disconnect. After deleting the log files and the LOG.old files in data/db/, the usage remains at 100%. These two newly added KV nodes cannot start, and the database is inaccessible;
Adjusted replica-schedule-limit to 64 to speed up the offline process;
[Issue Phenomenon and Impact]
The two offline KV nodes have been offline for 6 days, with 8G of data remaining. The download speed is particularly slow, taking about 10 hours to offload less than 1G of data;
Check the pd-ctl config show. The process of scaling in and out involves migrating regions, which will generate a lot of IO. Increasing the limit for concurrency will instead put more pressure on the disk.
We are now planning to use
operator add transfer-region
operator add transfer-leader
to transfer the regions and leaders from the KV nodes that are being decommissioned for a quick decommissioning. Is this operation feasible?
Once this issue is resolved, we will upgrade the version. Please take a look at the current issue first; it’s a production environment and quite urgent. Thank you.
In this version, it’s uncertain whether there is a place_holder_file. If there is, modify it.
[storage]
Change reserve-space = “5GB” to 0 to free up some space.
Then, the 2TB of space that was offline will be available.
Next, check the replicas to ensure there are three replicas.
If you confirm that there are three normal replicas, you can force delete the full node. Alternatively, ensure that the Regions on the full node have two normal replicas on other nodes, then you can delete it. Be very careful with this step.
curl -X POST http://{pdip}:2379/pd/api/v1/store/${store_id}/state?state=Tombstone
The remaining two replicas can be completed on the newly started TiKV.
The offline TiKV node has reduced from 2TB of data to 8GB.
Currently, if the node with a full disk can come back online and the database can resume access, I will just wait for the KV node to go offline, even if it takes longer.
If the node with a full disk cannot come back online, how can I speed up the offline process to proceed with subsequent operations?
I have set scheduler add evict-leader-scheduler, but I observed that the number of leaders on the offline KV node has not decreased. Is it okay to manually remove regions using operator add transfer-region?
For the third step, I saw online that manual removal is not recommended and that we should wait for automatic offline. However, the current offline speed is extremely slow. How can I troubleshoot this or adjust parameters to speed it up?
The nodes are completely full, not even a byte left, right? Then there’s no way to move. Even if you migrate the region out, RocksDB will issue a delete command, but RocksDB actually appends data, which also requires disk space.
Interactions between Raft nodes also need to append Raft logs, which also require disk space.
The node with the full disk is most likely beyond saving. Check if the regions on it have the majority of replicas on other regions. If they do, you can physically delete it.
If the data is very important, first shut down the machine and copy the 128GB of data to another machine with a larger hard drive, then start TiKV. At this point, there might be errors due to IP addresses or other issues. If you plan to do this, check in advance how to modify the store metadata in PD.
Is physical deletion safe? I see that the cluster is currently performing the remove-pending-down-replica operation on this downed KV node. What is the purpose of this?
The prerequisite for physical deletion is that each region has a normal 2 replicas, which is considered safe.
I couldn’t find relevant information about remove-pending-down-replica, so I’m not clear about the specific logic.
Manually removing the peer from the offline node has been completed, but now a new issue has arisen. The database can occasionally read but cannot write.
Currently, there are still two stores in the cluster that are down due to disk full issues. The number of down-peer-region-count and pending-peer-region-count in the region-health chart has decreased from 67.5K to 4.59K but is not decreasing further. The miss_peer_region_count data is currently dropping sharply. Should we wait for the database to balance out before it can return to normal?
There is no forced deletion, right? It’s just that two stores are down, causing the cluster to fail to start, right?
If that’s the case, then just be patient and wait.
You can use tidb-ctl to check regions, to see which regions each table has. Then check if these regions are normal.
Also, don’t you have any other solutions for your disk being full? Can’t you copy the data to another machine and start TiKV there? It’s just the IP that has changed, there should be a way to recover. Look for specific methods when you decide to do this.
There are ways to expand the disk as well. If using LVM, you can just add more space. If that doesn’t work, you can mount a remote directory using NFS. Copy the 128G data directory to the remote disk. There are many methods.