Query Becomes Slow, Recovers After Shutting Down One TiKV Node

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 查询变卡,关掉其中一个tikv节点后恢复

| username: zengchaojie

[TiDB Version] 5.4.1
[Encountered Issue] Around 3 PM, TiDB queries suddenly slowed down, and DM also stopped syncing. Later, it was discovered that the CPU of one of the TiKV nodes (out of a total of 7 TiKV nodes) was very high. After stopping this TiKV node, the cluster returned to normal.

| username: anafin | Original post link

To add: The logs show a lot of “check leader rpc costs too long”.

| username: songxuecheng | Original post link

  1. Complete TiKV monitoring
  2. The heatmap at that time
  3. Logs of the problematic TiKV node
| username: anafin | Original post link

The distribution includes the kv monitoring and the kv log files of the faulty node (192.168.10.70) (the content is too much, INFO level logs have been filtered out).
Fault duration: 25th from 15:00 to 17:27. During this period, a round of kv, tiflash, and db node restarts were performed with no effect. SQL was very slow whether querying or updating (the qps of SQL was lower than during normal periods because some traffic had already been diverted). After stopping the faulty node at 17:27, the machine’s SQL processing returned to normal.

| username: anafin | Original post link

Where can I see the heatmap?

| username: TiDBer_jYQINSnf | Original post link

Looking at the logs, there are many “key is locked” messages. Refer to the following post:

| username: zengchaojie | Original post link

This is the TiKV monitoring for the 3 hours of unavailability.

| username: anafin | Original post link

This analysis doesn’t explain why the cluster returned to normal after stopping this KV node; it doesn’t seem to be the root cause.
Additional information:

This node has a better CPU configuration compared to other KV nodes. During the failure, memory was normal, disk capacity was below 60%, IO usage decreased, IO volume decreased, and network traffic decreased (I understand these decreases because the traffic decreased). The anomaly: CPU usage was much higher compared to the other six normal KV nodes.

| username: TiDBer_jYQINSnf | Original post link

Let me check the detailed logs exported below.

| username: TiDBer_jYQINSnf | Original post link

Looking at 15:00, the raft write requests were very high. The write latency also increased. Was there any change in your business at that time?
Also, was there any change in the machine’s IO situation?
The read requests on machine 70 are also very high and unbalanced. Is there a hotspot?

| username: zengchaojie | Original post link

There hasn’t been much change in the business, this is the machine’s monitoring Prometheus monitoring_2022-07-26T07_31_22.362Z.json (1.1 MB)

| username: TiDBer_jYQINSnf | Original post link

I see a lot of transport full.
What is this storeid? TiKV always fails to send messages to it.

| username: TiDBer_jYQINSnf | Original post link

I looked at the above, it seems that 53 can’t connect, right? Check what’s going on with 53?

| username: songxuecheng | Original post link

Check this store 1422582, it seems to have a problem.

Avoid sending requests to unhealthy TiKV nodes to improve availability. This is version 5.4.2, consider upgrading.

| username: zengchaojie | Original post link

It should be a manual restart.

| username: zengchaojie | Original post link

How do you check the store_id mentioned in the reply above?

| username: TiDBer_jYQINSnf | Original post link

The pd-ctl store command shows which TiKV corresponds to the store ID.

| username: zengchaojie | Original post link

Found this table. According to the above reply, both 10.52 and 10.53 have issues?

| username: TiDBer_jYQINSnf | Original post link

Is 52\53 TiFlash? It looks like it’s because they can’t communicate. As for whether it is the cause of the sudden CPU increase, it’s uncertain.

| username: zengchaojie | Original post link

52 is TiKV, 53 is TiFlash.