Is there room for optimization for slow truncation on mechanical hard drives?

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 机械硬盘下truncat慢是否有优化的空间

| username: 郑旭东石家庄

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path]
The server uses a 4TB mechanical hard drive. During the data cleaning process, clearing intermediate tables, each approximately a few MB to over ten MB, takes more than 100 seconds. All database parameters are default. I am wondering if there is room for optimization to improve the table clearing speed.
[Encountered Problem: Problem Phenomenon and Impact]
During the data cleaning process, performing a truncate operation on an intermediate table of a few MB takes about 100 seconds.
[Resource Configuration]

[Attachment: Screenshot/Log/Monitoring]

The issue occurred between 2:00 AM and 2:30 AM.

| username: MrSylar | Original post link

Is truncate so IO-intensive? Logically, DDL only modifies metadata and shouldn’t use so much IO, even on mechanical disks.

| username: TiDBer_小阿飞 | Original post link

First, analyze where the time is being spent.
mysql> show profiles;
Find the corresponding Query_ID
mysql> show profile for query 1;
Check what the status with the largest duration is.

| username: 郑旭东石家庄 | Original post link

The slow SQL analysis on the management panel can also work.

| username: lemonade010 | Original post link

Check if there is a metadata lock. I encountered this once before.

| username: lemonade010 | Original post link

This is the troubleshooting link from back then:*81nu6h*_ga*OTU5NDU4NjQzLjE3MTY2MDIzNTQ.*_ga_D02DELFW09*MTcxNjg2MzQxNi4xNS4xLjE3MTY4NjM1MDUuMC4wLjA.

| username: 郑旭东石家庄 | Original post link

At that time, the situation was data cleaning, without other similar operations such as modifying tables or creating indexes, just clearing batches of intermediate tables. Meanwhile, there were a large number of data queries and statistics.

The DDL job queue basically consisted of create table and truncate table tasks.

| username: WinterLiu | Original post link

I suggest using SSDs. I once encountered a situation where installing MySQL on a mechanical disk in a certain cloud environment took ten minutes just to start up. After switching to an SSD, it started up in seconds.

| username: Kongdom | Original post link

Could it be that a large number of truncates accumulated, causing concurrency issues during GC?

| username: 小龙虾爱大龙虾 | Original post link

A normal mechanical disk shouldn’t be this slow. Check the TiDB logs during the issue period, for each node.

| username: 郑旭东石家庄 | Original post link

Initially, it took more than 300 milliseconds to clear the table. Later, the time varied, and eventually, it continuously increased and remained high.

| username: TiDBer_刚 | Original post link

The hot table might take a long time to truncate, I’ve encountered this in MySQL.

| username: 郑旭东石家庄 | Original post link

This is the GC monitoring of TiKV.

It started truncating from 00:10 and slowed down by 00:15. The truncation time continued to increase afterward.

| username: zhaokede | Original post link

Does the DDL job queue take a long time?

| username: 郑旭东石家庄 | Original post link

This is currently unclear; it is a task that runs automatically in the early morning.

| username: Kongdom | Original post link

:thinking: Does it run automatically in synchronous or asynchronous mode? Will asynchronous execution cause parallel execution or backlog?

| username: 郑旭东石家庄 | Original post link

Synchronization, multithreading

| username: FutureDB | Original post link

It is possible that there is a backlog of cleanup tasks. Previously, when we were cleaning up partitions beyond the lifecycle of partitioned tables in parallel, there was a backlog of tasks.

| username: Kongdom | Original post link

Is it like this every day? This seems impossible to execute synchronously because GC is an asynchronous operation. For end-of-day operations, can it be changed to single-threaded?

| username: 鱼跃龙门 | Original post link

It could be the wait caused by a metadata lock.