Dashboard Export of Slow Queries Causes Memory Surge and Triggers OOM

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Dashboard导出慢查引发内存暴涨触发OOM

| username: TiDB_C罗

[TiDB Usage Environment] Production Environment / Test / Poc
[TiDB Version]
[Reproduction Path] Exporting the last 30 minutes of slow queries on the Dashboard triggers OOM
[Encountered Problem: Phenomenon and Impact]
Since the slow query function in the Dashboard can only be opened one by one, I exported them locally to optimize one by one.
Selected the time period, the last 30 minutes, but there was no response when exporting.

Check the monitoring

Two of the machines experienced OOM, fortunately, one was still running.

Check the instance that experienced OOM

The 10 SQLs with the most time usage for OOM analysis
SQL 0:
cost_time: 1.376497381s
conn_id: 206183
user: dashboard_user
table_ids: [4611686018427387951]
txn_start_ts: 443004339273596931
mem_max: 25276 Bytes (24.7 KB)
sql: SELECT Digest, Query, INSTANCE, DB, Conn_ID, Succ, (UNIX_TIMESTAMP(Time) + 0E0) AS timestamp, Query_time, Parse_time, Compile_time, Rewrite_time, Preproc_subqueries_time, Optimize_time, Wait_TS, Cop_time, LockKeys_time, Write_sql_response_total, Exec_retry_time, Mem_max, Disk_max, Txn_start_ts, Prev_stmt, Plan, Is_internal, Index_names, Stats, Backoff_types, User, Host, Process_time, Wait_time, Backoff_time, Get_commit_ts_time, Local_latch_wait_time, Resolve_lock_time, Prewrite_time, Wait_prewrite_binlog_time, Commit_time, Commit_backoff_time, Cop_proc_avg, Cop_proc_p90, Cop_proc_max, Cop_wait_avg, Cop_wait_p90, Cop_wait_max, Write_keys, Write_size, Prewrite_region, Txn_retry, Request_count, Process_keys, Total_keys, Cop_proc_addr, Cop_wait_addr, Rocksdb_delete_skipped_count, Rocksdb_key_skipped_count, Rocksdb_block_cache_hit_count, Rocksdb_block_read_count, Rocksdb_block_read_byte FROM `INFORMATION_SCHEMA`.`CLUSTER_SLOW_QUERY` WHERE Time BETWEEN FROM_UNIXTIME(?) AND FROM_UNIXTIME(?) ORDER BY Query_time DESC LIMIT 100


Why is this query so harmful?

| username: tidb狂热爱好者 | Original post link

Large amount of normal data

| username: tidb狂热爱好者 | Original post link

I export all using select limit 1.

| username: MrSylar | Original post link

Currently not recommended usage practices

| username: 像风一样的男子 | Original post link

You can go to the database and export the INFORMATION_SCHEMA.CLUSTER_SLOW_QUERY table. Slow queries will be stored here.

| username: h5n1 | Original post link

Your version has already fixed this bug. Try reducing the time range and see if the issue persists. If there are still problems within a very short time range, it is likely a bug.

| username: 裤衩儿飞上天 | Original post link

You can remove some old and unused slow logs to speed it up. Do you need to remove them from each TiDB node?

| username: TiDB_C罗 | Original post link

I am indeed using version 5.4.0, the version above was selected incorrectly.

| username: h5n1 | Original post link

Then upgrade it.

| username: TiDB_C罗 | Original post link

What is the relationship between the table INFORMATION_SCHEMA.CLUSTER_SLOW_QUERY and the local slow query file of TiDB, and which one is generated first?

| username: tidb狂热爱好者 | Original post link

Check out this article I wrote: [link]. It specifically covers exporting slow SQL queries. The intervals need to be short because, during a failure, it’s common for billions of slow SQL queries to get stuck.

| username: 裤衩儿飞上天 | Original post link

The default value of tidb_distsql_scan_concurrency is 15.

| username: 像风一样的男子 | Original post link

It is recommended to regularly archive the slow.log under TiDB.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.