Query Becomes Slow, Recovers After Shutting Down One TiKV Node

translator_bot · June 23, 2024, 8:55am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 查询变卡，关掉其中一个tikv节点后恢复

| username: zengchaojie

[TiDB Version] 5.4.1
[Encountered Issue] Around 3 PM, TiDB queries suddenly slowed down, and DM also stopped syncing. Later, it was discovered that the CPU of one of the TiKV nodes (out of a total of 7 TiKV nodes) was very high. After stopping this TiKV node, the cluster returned to normal.

translator_bot · June 23, 2024, 8:55am

| username: anafin | Original post link

To add: The logs show a lot of “check leader rpc costs too long”.

translator_bot · June 23, 2024, 8:55am

| username: songxuecheng | Original post link

Complete TiKV monitoring
The heatmap at that time
Logs of the problematic TiKV node

translator_bot · June 23, 2024, 8:55am

| username: anafin | Original post link

The distribution includes the kv monitoring and the kv log files of the faulty node (192.168.10.70) (the content is too much, INFO level logs have been filtered out).
Fault duration: 25th from 15:00 to 17:27. During this period, a round of kv, tiflash, and db node restarts were performed with no effect. SQL was very slow whether querying or updating (the qps of SQL was lower than during normal periods because some traffic had already been diverted). After stopping the faulty node at 17:27, the machine’s SQL processing returned to normal.

translator_bot · June 23, 2024, 8:55am

| username: anafin | Original post link

Where can I see the heatmap?

translator_bot · June 23, 2024, 8:55am

| username: TiDBer_jYQINSnf | Original post link

Looking at the logs, there are many “key is locked” messages. Refer to the following post:

translator_bot · June 23, 2024, 8:55am

| username: zengchaojie | Original post link

This is the TiKV monitoring for the 3 hours of unavailability.

translator_bot · June 23, 2024, 8:55am

| username: anafin | Original post link

This analysis doesn’t explain why the cluster returned to normal after stopping this KV node; it doesn’t seem to be the root cause.
Additional information:

This node has a better CPU configuration compared to other KV nodes. During the failure, memory was normal, disk capacity was below 60%, IO usage decreased, IO volume decreased, and network traffic decreased (I understand these decreases because the traffic decreased). The anomaly: CPU usage was much higher compared to the other six normal KV nodes.

translator_bot · June 23, 2024, 8:55am

| username: TiDBer_jYQINSnf | Original post link

Let me check the detailed logs exported below.

translator_bot · June 23, 2024, 8:55am

| username: TiDBer_jYQINSnf | Original post link

Looking at 15:00, the raft write requests were very high. The write latency also increased. Was there any change in your business at that time?
Also, was there any change in the machine’s IO situation?
The read requests on machine 70 are also very high and unbalanced. Is there a hotspot?

translator_bot · June 23, 2024, 8:55am

| username: zengchaojie | Original post link

There hasn’t been much change in the business, this is the machine’s monitoring Prometheus monitoring_2022-07-26T07_31_22.362Z.json (1.1 MB)

translator_bot · June 23, 2024, 8:55am

| username: TiDBer_jYQINSnf | Original post link

I see a lot of transport full.
What is this storeid? TiKV always fails to send messages to it.

translator_bot · June 23, 2024, 8:55am

| username: TiDBer_jYQINSnf | Original post link

I looked at the above, it seems that 53 can’t connect, right? Check what’s going on with 53?

translator_bot · June 23, 2024, 8:55am

| username: songxuecheng | Original post link

Check this store 1422582, it seems to have a problem.

Avoid sending requests to unhealthy TiKV nodes to improve availability. This is version 5.4.2, consider upgrading.

github.com/pingcap/tidb

Recover time may be very long if TiKV enables hibernate region

opened 03:59AM - 24 May 22 UTC

closed 02:08PM - 14 Jun 22 UTC

sticnarf

type/enhancement sig/transaction affects-5.3 affects-5.4 affects-6.1

## Enhancement Considering one TiKV is unhealthy, all requests sending to tha…t store will timeout. But due to hibernate region, the regions whose leaders are on the unhealthy store will not be re-elected until a request is sent to the follower. If there are thousands of such regions, every time we are accessing a new region, it will send a request to the original leader on the unhealthy TiKV because the leader definitely does not change. And then, the user will always experience very long request duration until all regions are touched and elect new leaders. We can use the health check service of TiKV to get the serving status of TiKV. By filtering out healthy nodes, we can recover from TiKV failure more quickly.

translator_bot · June 23, 2024, 8:55am

| username: zengchaojie | Original post link

It should be a manual restart.

translator_bot · June 23, 2024, 8:55am

| username: zengchaojie | Original post link

How do you check the store_id mentioned in the reply above?

translator_bot · June 23, 2024, 8:55am

| username: TiDBer_jYQINSnf | Original post link

The pd-ctl store command shows which TiKV corresponds to the store ID.

translator_bot · June 23, 2024, 8:55am

| username: zengchaojie | Original post link

Found this table. According to the above reply, both 10.52 and 10.53 have issues?

translator_bot · June 23, 2024, 8:55am

| username: TiDBer_jYQINSnf | Original post link

Is 52\53 TiFlash? It looks like it’s because they can’t communicate. As for whether it is the cause of the sudden CPU increase, it’s uncertain.

translator_bot · June 23, 2024, 8:55am

| username: zengchaojie | Original post link

52 is TiKV, 53 is TiFlash.