How to Troubleshoot High Storage Async Snapshot Duration

translator_bot · June 22, 2024, 9:54pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: storage async snapshot duration过高如何排查

| username: ryans

[TiDB Usage Environment]
Production Environment

[TiDB Version]
v5.1.2

[Reproduction Path]
The cluster has a total of 4 KV machines with a total of 20TB of data. After dropping a 3TB table, one of the KV nodes in the cluster has been showing abnormal metrics for half a month, characterized by:

Normal CPU, no thread is fully utilized
High apply log duration (reaching seconds)
High storage async snapshot duration
Normal IO
Normal physical machine disk write latency
Rate snapshot message is 0, while other machines are in the tens
99.99% snapshot KV count is 0, while other machines are at 1.5 million
Approximate Region size is 2GB, while other machines are at 200MB

Currently, whenever the load increases, slow queries occur. Slow queries take about 1.3 seconds, with the prewrite phase taking 1.3 seconds.

[Resource Configuration]
500MB/s SSD
32C256G

translator_bot · June 22, 2024, 9:54pm

| username: ryans | Original post link

It seems that there is an issue with region splitting. On the abnormal machine, the region size keeps growing after dropping the table, while on other machines, the region size remains relatively stable.

translator_bot · June 22, 2024, 9:54pm

| username: h5n1 | Original post link

You can use tiup install diag to install Clinic, then collect data over a period of time and upload it.
Refer to this:

TiDB 的问答社区 – 25 Apr 22

【SOP 系列 22】TiDB 集群诊断信息收集 Clinic 使用指南&资料大全

🌌 运维指南 TiDB 运维手册

什么是 Clinic？ PingCAP Clinic 诊断服务（以下简称为 PingCAP Clinic）是 PingCAP 为 TiDB 集群提供的诊断服务，支持对使用 TiUP 或 TiDB Operator 部署的集群进行远程定位集群问题和本地快速检查集群状态，用于从全生命周期确保 TiDB 集群稳定运行、预测可出现的集群问题、降低问题出现概率、快速定位并修复问题。 PingCAP Clinic 目前处于 Technical Preview...

阅读时间: 1 mins 🕑 赞: 10 ❤

translator_bot · June 22, 2024, 9:54pm

| username: tidb菜鸟一只 | Original post link

For large tables, it is generally not recommended to drop them directly. You can truncate them first to release space.

translator_bot · June 22, 2024, 9:54pm

| username: ryans | Original post link

The data collection and desensitization process is quite complex, involving financial regulations from other countries.

translator_bot · June 22, 2024, 9:54pm

| username: ryans | Original post link

I found that after starting to drop a large table, tens of thousands of empty regions appeared and couldn’t be removed. The region size on this machine kept growing, and there were no errors in the kv log. I suspect it might be due to the region being too large, causing issues when creating snapshots.

The problematic machine has been manually restarted several times and is now completely down, only acting as a learner. There are also hundreds of regions in the DOWN state on it.

PD keeps trying to migrate the leader to this machine. After receiving the migration request, it performs the migration operation normally but then reports that its term is lower than the other party’s, so it keeps becoming a follower, and nothing else happens.

The good news is that the overall performance of the cluster has returned to normal. I will try to destroy and rebuild the problematic node.

It seems like directly dropping a large table encountered some strange issues.

translator_bot · June 22, 2024, 9:54pm

| username: ryans | Original post link

The latest progress: it was found that the region in the DOWN state has been trying to add the abnormal machine as a learner, then timing out after 10 minutes, and trying again after a while.

There has never been an attempt to elect the abnormal machine as a leader in the PD logs. I’ll continue to investigate why.

translator_bot · June 22, 2024, 9:54pm

| username: h5n1 | Original post link

`max-store-down-time`

The time PD considers a missing store to be unrecoverable. If no heartbeat is received from the store after the specified time, PD will replenish the replicas on other nodes.
Default value: 30m

translator_bot · June 22, 2024, 9:54pm

| username: ryans | Original post link

I found out that for some reason, this scheduler evicted all the leaders from node 4. I will continue to investigate.

» scheduler show
[
“balance-hot-region-scheduler”,
“balance-leader-scheduler”,
“balance-region-scheduler”,
“label-scheduler”,
“evict-leader-scheduler”
]

» scheduler config evict-leader-scheduler
{
“store-id-ranges”: {
“4”: [
{
“start-key”: “”,
“end-key”: “”
}
]
}
}

translator_bot · June 22, 2024, 9:54pm

| username: ryans | Original post link

I saw a post, it seems that during the restart, the restart was too slow and then timed out, so this evict was never removed.

Now, because there are a batch of regions in the DOWN state on the abnormal node, it is decided not to let it be scheduled as the leader before resolving it.

translator_bot · June 22, 2024, 9:54pm

| username: ryans | Original post link

I suspect the timeout is related to the issue in this post. We have a field that contains a large JSON, which is causing the operator to timeout continuously. Is there a way to adjust the timeout duration?

translator_bot · June 22, 2024, 9:54pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.