Miss-peer-region-count does not decrease

translator_bot · June 22, 2024, 11:59am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: miss-peer-region-count不下降

| username: 张小凉1

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.5
[Encountered Issue: Phenomenon and Impact]
The cluster suddenly has many miss-peer-region-count.
It was found that the number of regions on each node is balanced.
[Resource Configuration]
3 KV nodes, 3 PD nodes, 3 TiDB-server nodes
[Attachments: Screenshots/Logs/Monitoring]

Enterprise WeChat Screenshot_16805755331514

Please advise, experts.

translator_bot · June 22, 2024, 11:59am

| username: 裤衩儿飞上天 | Original post link

Check the status of each TiKV.

translator_bot · June 22, 2024, 11:59am

| username: 张小凉1 | Original post link

The status of the KV is all normal.

translator_bot · June 22, 2024, 11:59am

| username: 裤衩儿飞上天 | Original post link

What does the cluster topology look like?
Are there any disk errors?
Is there any scheduling stuck?
Are there any errors in the TiKV logs?

translator_bot · June 22, 2024, 11:59am

| username: 张小凉1 | Original post link

Cluster topology:

translator_bot · June 22, 2024, 11:59am

| username: 张小凉1 | Original post link

There should be no disk errors.
How can I check the scheduling information?

translator_bot · June 22, 2024, 11:59am

| username: 裤衩儿飞上天 | Original post link

Grafana monitoring: pd-scheduler
Check the OS logs of each TiKV node
PD logs
Overall network situation
Grafana monitoring: pd-heartbeat
tiup cluster display XXX

translator_bot · June 22, 2024, 11:59am

| username: 张小凉1 | Original post link

pd log:
[WARN] [util.go:163] [“apply request took too long”] [took=157.598105ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:"/topology/tidb/" range_end:"/topology/tidb0" "] [response=“range_response_count:6 size:1059”]

pd-heartbeat:

pd-scheduler:

tiup cluster display:

translator_bot · June 22, 2024, 11:59am

| username: h5n1 | Original post link

Use pd-ctl region check miss-peer to see which regions are affected. Then use pd-ctl region xx to check the status.

translator_bot · June 22, 2024, 11:59am

| username: 张小凉1 | Original post link

Two ways to view the same region, “pending_peers” are different. Is this the problem?

The number of miss-peer in config show is very large, not sure how to count the specific number.

translator_bot · June 22, 2024, 11:59am

| username: 张小凉1 | Original post link

In addition to miss-peer, there are also many empty-region and undersized-region.

translator_bot · June 22, 2024, 11:59am

| username: h5n1 | Original post link

Check if the disk and CPU of TiKV are busy.

translator_bot · June 22, 2024, 11:59am

| username: 张小凉1 | Original post link

Low utilization

translator_bot · June 22, 2024, 11:59am

| username: h5n1 | Original post link

Check the TiKV detail – raft propose – apply monitoring latency.

translator_bot · June 22, 2024, 11:59am

| username: 张小凉1 | Original post link

The image you provided is not accessible. Please provide the text content that needs to be translated.

translator_bot · June 22, 2024, 11:59am

| username: TiDBer_pkQ5q1l0 | Original post link

Is there a timeout error reported for PD scheduling?

translator_bot · June 22, 2024, 11:59am

| username: 裤衩儿飞上天 | Original post link

Please upload the log information of PD and TiKV, let’s troubleshoot.
When providing monitoring data, it’s best to provide it for the same time period as the issue.

translator_bot · June 22, 2024, 11:59am

| username: h5n1 | Original post link

Can you take a screenshot of the miss-region monitoring, from before the issue occurred to now?