How to Troubleshoot When the Database Suddenly Stalls and Affects All Business Operations

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 数据库突然卡顿一下,业务全部收到影响,如何排查原因。

| username: TiDBer_Y2d2kiJh

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v5.4.0 2tidb 3pd 3tikv 2ha
[Reproduction Path] Around 14:57, the business was interrupted, and the query latency was very high at that time. One of the TiKV’s IO reached 100%, but there were no particularly severe slow queries. How should I troubleshoot this issue? Screenshots are as follows:
[Encountered Problem: Symptoms and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]





| username: h5n1 | Original post link

Check the tikv-detail → errors and tidb → kv errors to see if there is anything.

| username: tidb菜鸟一只 | Original post link

Check the logs at which point in time, mainly looking at the errors.

| username: zhanggame1 | Original post link

Looking at the picture, there seems to be a time lag.
First, check the database logs for any errors.
Then, check the operating system logs for any anomalies.
Also, investigate the network situation to see if there is any network lag.

| username: TiDBer_oHSwKxOH | Original post link

You see

  1. Regularly clean up large tables
  2. Establish an audit system to eliminate bad habits like “select *”
  3. Prioritize solving the issues that consume the most resources during this period
| username: TiDBer_Y2d2kiJh | Original post link

According to the operating system logs, it is their storage that has an issue. The symptoms are consistent with this:

| username: linnana | Original post link

Is there a storage failure on that TiKV machine with high I/O?

| username: TiDBer_Y2d2kiJh | Original post link

There might be an issue with the network cable connecting the switch to a port on the storage server. The storage for this TiKV is not on the same LUN as the storage for the other two TiKVs.

| username: zhanggame1 | Original post link

When we used Oracle before, we also encountered similar lags. Later, we discovered that one of the multiple fiber optic cables to the storage was dropping packets.

| username: cy6301567 | Original post link

Check the monitoring of TiDB and KV during that period.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.