Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 数据库突然卡顿一下,业务全部收到影响,如何排查原因。
[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version] v5.4.0 2tidb 3pd 3tikv 2ha
[Reproduction Path] Around 14:57, the business was interrupted, and the query latency was very high at that time. One of the TiKV’s IO reached 100%, but there were no particularly severe slow queries. How should I troubleshoot this issue? Screenshots are as follows:
[Encountered Problem: Symptoms and Impact]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
Check the tikv-detail → errors and tidb → kv errors to see if there is anything.
Check the logs at which point in time, mainly looking at the errors.
Looking at the picture, there seems to be a time lag.
First, check the database logs for any errors.
Then, check the operating system logs for any anomalies.
Also, investigate the network situation to see if there is any network lag.
According to the operating system logs, it is their storage that has an issue. The symptoms are consistent with this:
Is there a storage failure on that TiKV machine with high I/O?
There might be an issue with the network cable connecting the switch to a port on the storage server. The storage for this TiKV is not on the same LUN as the storage for the other two TiKVs.
When we used Oracle before, we also encountered similar lags. Later, we discovered that one of the multiple fiber optic cables to the storage was dropping packets.
Check the monitoring of TiDB and KV during that period.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.