Slow Queries After Power Outage and Restart in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB断电重启后慢查询

| username: Dais

[TiDB Usage Environment] Test Environment
[TiDB Version] V6.5.2
[Reproduction Path]
There was a power outage and restart before the server room had issues.
[Encountered Problem]
TS_DOCKPF_RECORD table has 8000 rows
TS_DOCKPF table has 22 rows
Executing the following query SQL:
SELECT
** T.RECORD_REQ AS recordReq**
FROM
** TS_DOCKPF_RECORD T**
LEFT JOIN TS_DOCKPF T1 ON
** T1.RID = T.DOCKPF_ID**
** WHERE T1.UID = ‘78AA82AA0C01474C88238161234567890’**
LIMIT 10
takes more than 55 seconds.
If the condition “WHERE T1.UID” is removed, the query only takes tens of milliseconds.
An index is built on T1.UID. The special case is that the result field T.RECORD_REQ is of type longtext, which contains large text content.
Random checks show that other business SQLs are executed normally.

[Resource Configuration]


[Execution Plan of the Above SQL]

Preliminary guess is that a sudden power outage and restart in the server room might have caused a region loss in a certain TiKV, but it’s not certain. I don’t know how to pinpoint exactly where the problem is or which TiKV node has the issue. Please help me take a look.

| username: tidb菜鸟一只 | Original post link

This should have nothing to do with region corruption; it’s simply an issue with the SQL execution plan. You mentioned that removing the condition on where T1.UID results in the query taking only a few milliseconds. Could you post the execution plan for that scenario?

| username: Dais | Original post link

Execution plan without the condition on T1.UID

If the condition on T1.UID is placed after ON, the execution takes only a few milliseconds.
explain analyze
SELECT
T.RECORD_REQ AS recordReq
FROM
TS_DOCKPF_RECORD T
LEFT JOIN TS_DOCKPF T1 ON
T1.RID = T.DOCKPF_ID AND T1.UID = ‘78AA82AA0C01474C88238161234567890’
LIMIT 10

Or if the query result does not include large fields, querying common fields like rid, name, etc., the query is also fast.
I have communicated with people from TiDB, and they suspect that it might be due to region loss on the TiKV node.

| username: tidb菜鸟一只 | Original post link

Your execution plan seems incorrect. If you remove the condition on where T1.UID, does the join disappear entirely? The execution plan you provided looks like it’s for a single table.

| username: 像风一样的男子 | Original post link

In the dashboard, there is a log search feature. Check if there are any errors reported by each node.

| username: Dais | Original post link

I removed the WHERE condition.


To differentiate, I added the RID field from table T1 in the query results.

| username: Dais | Original post link

This is the error log from yesterday afternoon.


At that time, the business scenario was: once our business system started connecting to the database, the database would become sluggish and the latency would increase. Later, we found that there was a timer in the business system that was periodically calling the above query SQL. After analysis, we determined that the above SQL query was too slow and was being executed too frequently.
After turning off the corresponding timer, the database returned to normal. Other data query SQLs were normal, except for the above SQL, which took more than 50 seconds to execute.

| username: 像风一样的男子 | Original post link

The slow query caused TiDB to OOM.
After optimizing the slow SQL, you can limit the memory usage of the entire TiDB and individual sessions.

| username: zhanggame1 | Original post link

How is the speed of querying T1.UID = ‘78AA82AA0C01474C88238161234567890’ in this table?

| username: Kongdom | Original post link

Check the monitoring in Grafana to see if there are any abnormal resource conditions and confirm whether it is an OOM (Out of Memory) issue.

| username: 小龙虾爱大龙虾 | Original post link

The reason why the join is removed after removing the condition on where T1.UID is that it was eliminated during logical optimization. The optimizer found that not querying the T table does not affect the execution result.

| username: Soysauce520 | Original post link

You can check the health of the region in the overview of Grafana monitoring under PD, where there is a graph of the region status.

| username: 小龙虾爱大龙虾 | Original post link

From the execution plan, it appears that some table lookups have slowed down. You can check the slow logs to see which specific TiKV is slow and whether there are any anomalies with that TiKV.

| username: Dais | Original post link

Could you please specify which logs and what content need to be reviewed? Could you provide specific guidance? Thank you.

| username: Dais | Original post link

The image is not visible. Please provide the text you need translated.

| username: 有猫万事足 | Original post link

The main issue is the slowness. Please provide the complete Execution info for this line. Check if the number of scanned keys is abnormal.

It seems like it could be a GC issue or too many historical versions.

| username: Dais | Original post link

time: 1m55.7s, loops: 11, cop_task: {num: 16, max: 49.6s, min: 1.75ms, avg: 14.3s, p95: 49.6s, max_proc_keys: 637, p95_proc_keys: 637, tot_proc: 4.73s, tot_wait: 8ms, rpc_num: 16, rpc_time: 3m48.7s, copr_cache_hit_ratio: 0.06, distsql_concurrency: 15}, tikv_task: {proc max: 196ms, min: 9ms, avg: 109.1ms, p80: 144ms, p95: 196ms, iters: 71, tasks: 16}, scan_detail: {total_process_keys: 7019, total_process_keys_size: 1281178524, total_keys: 7039, get_snapshot_time: 12.2ms, rocksdb: {key_skipped_count: 14018, block: {cache_hit_count: 6702, read_count: 510, read_byte: 7.20 MB, read_time: 4.86ms}}}

| username: tidb菜鸟一只 | Original post link

It seems to be an issue with the dockpf_id index on table T. This table only has 8000 rows, right? Try rebuilding this index and see if it helps.

| username: 有猫万事足 | Original post link

It should be unrelated to GC.

I can’t see any anomalies, except that total_process_keys_size: 1281178524 is a bit large. But the data read from over 7000 rows isn’t that large either.

| username: h5n1 | Original post link

The time is slow on the T table back table. Use trace select to see which region is slow, then check the region leader’s TiKV status through tikv_region_status to see if there are any anomalies, network delays, or thread CPU issues.