A TiKV Node is Slacking Off (Hardly Processing Requests)

translator_bot · June 24, 2024, 5:29am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 某个tikv节点全程划水(几乎不处理请求)

| username: panqiao

【TiDB Usage Environment】Production Environment / Testing / PoC
【TiDB Version】6.5.5
【Reproduction Path】What operations were performed to cause the issue
【Encountered Issue: Issue Phenomenon and Impact】
【Resource Configuration】Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
【Attachments: Screenshots/Logs/Monitoring】
The system is very slow, and there are many slow queries in the database. The CPU usage of the TiKV nodes exceeds 80%, but there is one TiKV node whose CPU is almost unused throughout, indicating a serious load imbalance. My TiKV nodes have 4 cores and are deployed on k8s.

Looking at tikv-2 alone, it is almost idle throughout

translator_bot · June 24, 2024, 5:30am

| username: panqiao | Original post link

I checked the corresponding TiKV logs, and there are no error logs, only info logs.

translator_bot · June 24, 2024, 5:30am

| username: zhaokede | Original post link

The load is severely unbalanced. Is there not a single leader above?

translator_bot · June 24, 2024, 5:30am

| username: TiDBer_jYQINSnf | Original post link

Check the number of regions and leaders.

translator_bot · June 24, 2024, 5:30am

| username: TiDBer_QKDdYGfz | Original post link

The distribution of leaders is uneven, they are all followers.

translator_bot · June 24, 2024, 5:30am

| username: zhaokede | Original post link

Use TiDB’s monitoring tools (such as Grafana) and diagnostic logs to check the load status, data distribution, and GC status of TiKV nodes.

translator_bot · June 24, 2024, 5:30am

| username: 舞动梦灵 | Original post link

Open the TiDB Dashboard cluster information and check the host. Look at this graph and observe the average CPU usage and memory usage. If one is particularly high and the other is particularly low, it is basically the leader and follower. If the 999 latency is very low, you don’t need to handle it.