TiKV Access Abnormal, Various Monitoring Metrics Abnormal

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiKV访问异常,监控各项指标异常

| username: TiDBer_wX9akOFm

[TiDB Usage Environment] Production Environment
[TiDB Version] v7.5.0
[Reproduction Path] None
[Encountered Problem: Problem Phenomenon and Impact] TiKV access anomaly, monitoring shows a sudden drop in various TiKV metrics such as CPU and IO. Currently suspecting it is caused by region merge, but not sure. 1. Want to confirm from the monitoring where it can be evidenced that the impact is due to merge. 2. If it is caused by merge, is there any way to optimize and reduce the impact.
“max-merge-region-keys”: 200000,
“max-merge-region-size”: 20,
“enable-cross-table-merge”: “true”,
“merge-schedule-limit”: 8,
[Resource Configuration] Three physical machines each running 1 PD node and 2 TiKV nodes
[Attachments: Screenshots/Logs/Monitoring]
TiKV-Summary: Cluster



| username: yytest | Original post link

It is recommended to provide the underlying logs.

| username: 随缘天空 | Original post link

Isn’t reducing resource usage a good thing, as long as the cluster is running normally without any issues?

| username: tidb菜鸟一只 | Original post link

If you don’t want region merging, you can turn off enable-cross-table-merge, then increase the values of “max-merge-region-keys” and “max-merge-region-size”. The merge-schedule-limit parameter should already be quite small, but if you feel it has an impact, you can reduce it further…

| username: Ming | Original post link

Is there an error displayed on the business end when access is abnormal?

| username: 小龙虾爱大龙虾 | Original post link

How to locate the impact of merge

| username: zhaokede | Original post link

Is the data query functioning normally?

| username: TiDBer_QYr0vohO | Original post link

Take a look at the TiKV logs.

| username: TiDBer_wX9akOFm | Original post link

The logs of TiKV are all INFO messages without any obvious errors. Some examples are shown below, and others are similar. I am not sure where to look for clues on the monitoring panel.

| username: TiDBer_wX9akOFm | Original post link

In cases like this where it suddenly drops to the bottom and then rises again, it is actually similar to the situation where the cluster is stuck and unresponsive. However, there are no logs of application errors retained now, so I want to find some evidence from the monitoring panel.

| username: TiDBer_wX9akOFm | Original post link

Sure, there’s no need to disable it. If configuring it this way doesn’t easily impact performance, let’s keep it configured like this and see if we can reproduce the issue. Thank you for your suggestion.

| username: TiDBer_wX9akOFm | Original post link

Just suspicious, because I happened to see a decrease in empty regions. The logs don’t show any information. Not sure where else to pinpoint.

| username: tidb菜鸟一只 | Original post link

Please share the topology diagram again. This isn’t very clear. Mainly, we need to check if TiDB and TiKV are mixed. If they are, check the memory usage…

| username: 友利奈绪 | Original post link

This is not apparent.

| username: Hacker_6ASfgBFe | Original post link

Looking at the monitoring, your regions haven’t decreased much either. I’m using version 5.4 and hadn’t enabled region merging before. I just enabled it the day before yesterday, and the regions have decreased by 600,000. The resource usage hasn’t dropped significantly, nor has it decreased much, so it’s barely noticeable. My cluster has a total of over 4 million regions.