TiKV IO Surge

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv io剧增

| username: simple-hx

[TiDB Usage Environment] Production Environment
[TiDB Version] 4.0.12
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Symptoms and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

Can any expert help me out? The TiDB 4.0.12 version deployed online experiences TiKV CPU saturation. When the CPU is fully utilized, all SQL operations run slowly, making the system unusable. After taking the affected machine offline, the database returns to normal. This issue occurs approximately every two months. Can any expert help check the monitoring and troubleshoot? Every time the issue occurs, all SQL operations become slow queries, and it’s impossible to identify specific SQL queries from the slow queries.

tidb-dwgl-TiKV-Details-1682480684745.json (1.7 MB)

| username: 裤衩儿飞上天 | Original post link

  1. Cluster topology related logs
  2. Are there any scheduled tasks?
  3. Is there a significant increase in I/O or is the CPU fully utilized? Is the CPU fully utilized on all TiKV nodes or just one?
  4. Slow SQL can be analyzed using pt-query-digest to examine the slow logs during the problematic time period.
| username: tidb菜鸟一只 | Original post link

How many TiKV nodes are there, and if one of them has an issue, will taking the problematic node offline alleviate the problem?

| username: simple-hx | Original post link

Previously, the 4 KV nodes were each 16c32g. Since this issue started happening last year, we initially thought it was a disk problem. Later, we took them offline, and now there are 3 nodes. The issue reappeared today, so we took the problematic machine offline, and after a while, the service recovered.

| username: simple-hx | Original post link

Currently investigating slow SQL but can’t identify the issue, because rerunning the slow SQL doesn’t show any delay. There are no scheduled tasks, with 3 PD, 3 TiDB, and 3 TiKV (previously 4, but one was removed).

| username: simple-hx | Original post link

Deploying TiDB and PD on the same machine

| username: 裤衩儿飞上天 | Original post link

  1. For slow SQL, you can use pt-query-digest to analyze the slow logs during the problematic time period, and prioritize the first few entries.
  2. You already have three nodes, and you still took one down…
    The amazing thing is that the business access didn’t slow down, but actually recovered…