Error in Cluster Status Check During Cluster Upgrade

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 升级集群时集群状态检测报错

| username: SoHuDrgon

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.1.0

[Encountered Problem: Problem Phenomenon and Impact]
I want to upgrade the TiDB version to the latest version, but when checking the cluster, the cluster status reports an error.
Checking region status of the cluster loantidb…
Regions are not fully healthy: 14 pending-peer
Please fix unhealthy regions before other operations.
I have provided the pd-ctl region check pending-peer and monitoring screenshots.
How can I eliminate the pending-peer status? Thank you, everyone.

Restarting TIKV still has the same issue.
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]


pending_peer14.txt (17.8 KB)

| username: WalterWj | Original post link

Pending generally means that there is too much pressure on TiKV, causing queuing and blockage.

This situation is uncommon. Could it be that compaction was abnormally terminated?

| username: SoHuDrgon | Original post link

What does it mean? I don’t quite understand.

| username: WalterWj | Original post link

You can check a post from 2020, which is similar to your situation: tikv下线产生learner-peer 和 pending-peer - #74,来自 kimi - TiDB 的问答社区

It means that your situation is unexpected. Most likely, there is something abnormal with the cluster.

| username: WalterWj | Original post link

Has the cluster ever used Lightning or modified any TiKV compact-related configurations? Also, is the cluster under heavy load? If everything is generally normal, there should rarely be any pending issues. If something is pending for a long time, it should eventually result in a down peer.

| username: WalterWj | Original post link

I took a look at the monitoring, and there are indeed down peers in your cluster.

| username: SoHuDrgon | Original post link

Does tikv-ctl need to be installed on every TiKV, or can it be executed on just one of them? Thanks.

| username: SoHuDrgon | Original post link

There was no lighting or modification done, just frequent batch deletions by the development side using the batch delete command. That should be the cause, but it’s unclear how long this pending-peer will take to disappear.

| username: SoHuDrgon | Original post link

Moreover, the content I get after executing region check pending-peer for a period of time is always different, but the number of pending peers is always 14. I don’t know what is happening inside. The cluster only imports data at scheduled times every day, and it is idle most of the time.

| username: WalterWj | Original post link

You can check the monitoring, for example, look at the past 3 days, and see if there are any time periods where down and pending peers disappear. If there are, choose that time period to upgrade.

Is there any special configuration for the current TiKV? You can also send it over for us to take a look.

Or manually perform a TiKV compact on the cluster: TiKV Control 使用说明 | PingCAP 文档中心
You can see if this solves the issue, but it needs to be done during non-business hours.

| username: WalterWj | Original post link

What function is used to import data? TiDB Lightning 故障处理 | PingCAP 文档中心

If you have this situation, it can also lead to pending or even down peer.

| username: SoHuDrgon | Original post link

Monitoring as shown in the figure:


7 days.
The configuration of TiDB is as follows:
tidb.yaml (6.7 KB)
Please help check if there is anything configured incorrectly.

| username: SoHuDrgon | Original post link

The data import process involves a Java project developed to fetch data files from other sources. The data is then cleaned within the project and a CSV file is generated. This process is executed daily, producing a CSV file of approximately 300MB each time. The Java project reads the CSV file, generates SQL statements, and then inserts the data into TiDB.

| username: WalterWj | Original post link

I took a look and saw that many TiKV configurations were adjusted. Generally, it is not recommended to adjust raft and rocksdb related configurations by yourself. Sigh.

| username: WalterWj | Original post link

split-region-check-tick-interval – No need to adjust to 300

  • The interval for checking whether a region needs to be split, 0 means disabled.
  • Default value: 10s
  • Minimum value: 0

Moreover, you have disabled compaction. Adjust it, enable compaction, and you can manually use tikv-ctl to perform cluster compaction.

It is not recommended to change these settings.

| username: WalterWj | Original post link

Compact cannot be turned off. Compact is the reorganization and compression of underlying data files. If you turn it off, even garbage collection (GC) cannot be performed.

| username: SoHuDrgon | Original post link

Okay, thank you. I’ll make the change and restart TiDB right away, then observe the results.

| username: SoHuDrgon | Original post link

I changed the two parameters you mentioned:
raftstore.split-region-check-tick-interval: 30s
rocksdb.defaultcf.disable-auto-compactions: false
After restarting TiDB, everything else is still displayed on the monitoring interface, but the pending-peer and other metrics are not showing:


:crazy_face:

| username: WalterWj | Original post link

It seems like there’s an issue with your monitoring. The compaction shouldn’t be completed that quickly.

| username: SoHuDrgon | Original post link

It looks like it’s working now. I restarted Prometheus.
Screenshot 2023-01-10 at 1.25.22 PM
Then I checked the pending-peer:

tiup ctl:v6.1.0 pd -i -u http://127.0.0.1:2379

Starting component ctl: /root/.tiup/components/ctl/v6.1.0/ctl pd -i -u http://127.0.0.1:2379
» region check pending-peer
{
“count”: 0,
“regions”:
}
Finally, I checked the cluster:
Checking region status of the cluster loantidb…
All regions are healthy.
Thank you very much.

The upgrade was also completed quickly:
Screenshot 2023-01-10 at 1.46.54 PM