Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: storage async snapshot duration过高如何排查
[TiDB Usage Environment]
Production Environment
[TiDB Version]
v5.1.2
[Reproduction Path]
The cluster has a total of 4 KV machines with a total of 20TB of data. After dropping a 3TB table, one of the KV nodes in the cluster has been showing abnormal metrics for half a month, characterized by:
- Normal CPU, no thread is fully utilized
- High apply log duration (reaching seconds)
- High storage async snapshot duration
- Normal IO
- Normal physical machine disk write latency
- Rate snapshot message is 0, while other machines are in the tens
- 99.99% snapshot KV count is 0, while other machines are at 1.5 million
- Approximate Region size is 2GB, while other machines are at 200MB
Currently, whenever the load increases, slow queries occur. Slow queries take about 1.3 seconds, with the prewrite phase taking 1.3 seconds.
[Resource Configuration]
500MB/s SSD
32C256G
It seems that there is an issue with region splitting. On the abnormal machine, the region size keeps growing after dropping the table, while on other machines, the region size remains relatively stable.
You can use tiup install diag
to install Clinic, then collect data over a period of time and upload it.
Refer to this:
For large tables, it is generally not recommended to drop them directly. You can truncate them first to release space.
The data collection and desensitization process is quite complex, involving financial regulations from other countries.
I found that after starting to drop a large table, tens of thousands of empty regions appeared and couldn’t be removed. The region size on this machine kept growing, and there were no errors in the kv log. I suspect it might be due to the region being too large, causing issues when creating snapshots.
The problematic machine has been manually restarted several times and is now completely down, only acting as a learner. There are also hundreds of regions in the DOWN state on it.
PD keeps trying to migrate the leader to this machine. After receiving the migration request, it performs the migration operation normally but then reports that its term is lower than the other party’s, so it keeps becoming a follower, and nothing else happens.
The good news is that the overall performance of the cluster has returned to normal. I will try to destroy and rebuild the problematic node.
It seems like directly dropping a large table encountered some strange issues.
The latest progress: it was found that the region in the DOWN state has been trying to add the abnormal machine as a learner, then timing out after 10 minutes, and trying again after a while.
There has never been an attempt to elect the abnormal machine as a leader in the PD logs. I’ll continue to investigate why.
I found out that for some reason, this scheduler evicted all the leaders from node 4. I will continue to investigate.
» scheduler show
[
“balance-hot-region-scheduler”,
“balance-leader-scheduler”,
“balance-region-scheduler”,
“label-scheduler”,
“evict-leader-scheduler”
]
» scheduler config evict-leader-scheduler
{
“store-id-ranges”: {
“4”: [
{
“start-key”: “”,
“end-key”: “”
}
]
}
}
I saw a post, it seems that during the restart, the restart was too slow and then timed out, so this evict was never removed.
Now, because there are a batch of regions in the DOWN state on the abnormal node, it is decided not to let it be scheduled as the leader before resolving it.
I suspect the timeout is related to the issue in this post. We have a field that contains a large JSON, which is causing the operator to timeout continuously. Is there a way to adjust the timeout duration?
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.