Upgrading TiDB Cluster from v5.4.0 to v6.5.0: Primary Cluster Check Passed, Secondary Cluster Check Error: Regions are not fully healthy: 17061 miss-peer

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb集群从v5.4.0升级到v6.5.0,主库集群check检查正常,从库集群检查报错:Regions are not fully healthy: 17061 miss-peer

| username: vcdog

[TiDB Usage Environment] Pre-production environment
[TiDB Version] Upgraded from v.5.4.0 to v6.5.0
[Reproduction Path] Currently, both the master and slave clusters have been upgraded to version v6.5.0. The check command executed on the master cluster returns normal results. However, an error occurs when checking the slave cluster: Regions are not fully healthy: 17061 miss-peer
[Encountered Problem: Phenomenon and Impact]

Master cluster, execute the following command:
# tiup cluster check pre-tidb-cluster-01 --cluster
Checking region status of the cluster pre-tidb-cluster-01...
All regions are healthy.

Slave cluster, execute the following command:
# tiup cluster check test-tidb-cluster-01 --cluster
Checking region status of the cluster test-tidb-cluster-01...
Regions are not fully healthy: 17061 miss-peer
Please fix unhealthy regions before other operations.

Grafana monitoring of PD Region health in the master cluster:

Grafana monitoring of PD Region health in the slave cluster:

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

How to solve this problem? Do we need to fix those unhealthy Regions?

| username: TI表弟 | Original post link

Two steps: find the region and remove the region.

| username: vcdog | Original post link

There is another operational scenario that has not been mentioned:

  1. The primary cluster upgrade operation time was last Friday.
  2. The secondary cluster upgrade operation time was this Monday.
  3. The primary and secondary cluster upgrades took about 3 days.

I don’t know if this is related, causing the region miss.

| username: vcdog | Original post link

Doesn’t the region lose data?

| username: TI表弟 | Original post link

| username: TI表弟 | Original post link

Here is the manual for finding and removing problematic regions.

| username: TI表弟 | Original post link

What does master-slave mean? Did you implement master-slave in TiDB?

| username: Kongdom | Original post link

Solution from @xfworld

Then don’t upgrade for now.

  1. Check if there are any regions with missing replicas in the existing cluster.
  2. If these regions with missing replicas are not important, consider manually removing them.
  3. Once the cluster status is normal, upgrade to 6.5.0.

Operation reference:
https://docs.pingcap.com/zh/tidb/stable/pd-control#region-check-miss-peer–extra-peer–down-peer–pending-peer–offline-peer–empty-region–hist-size–hist-keys

Removal method reference:

image

https://docs.pingcap.com/zh/tidb/stable/pd-control#恢复数据时寻找相关-region

| username: TI表弟 | Original post link

A small amount of data may be lost.

| username: 裤衩儿飞上天 | Original post link

It is recommended to observe, as the missing peer will be automatically replenished. Multiple replica losses will need to be handled manually.
PS: Was the missing peer caused by the upgrade? Or was there a missing peer before the upgrade?

| username: TI表弟 | Original post link

How did you upgrade?

| username: 考试没答案 | Original post link

Does the replica handle business operations? How large is the data volume?

| username: 考试没答案 | Original post link

Master-slave architecture: If the master is fine but the slave database loses data, you can consider rebuilding the slave database. This is feasible if the slave database does not handle any business operations.

| username: 考试没答案 | Original post link

Please provide the details of your upgrade process. Let’s see if the issue lies in the process.

| username: TI表弟 | Original post link

I think the key is to find out why the replica lost data and what the differences are between the primary and replica databases.

| username: 考试没答案 | Original post link

Yes, you can look into it more deeply. Maybe you’ll find a bug. Improve your skills!

| username: TI表弟 | Original post link

Strongly agree with the expert’s opinion, making progress every day.

| username: TI表弟 | Original post link

If this were to automatically recover, it should have been completed a long time ago.

| username: vcdog | Original post link

Yes, two TiDB clusters are set up with master-slave data synchronization through TiCDC, achieving read-write separation. OLTP operations are completed in the primary cluster, while OLAP report queries are handled in the secondary cluster.

| username: 考试没答案 | Original post link

Why not consider TiFlash in your architecture? Does your company have any special requirements?