Upgrading TiDB Cluster from v5.4.0 to v6.5.0: Primary Cluster Check Passed, Secondary Cluster Check Error: Regions are not fully healthy: 17061 miss-peer

translator_bot · June 22, 2024, 4:56pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb集群从v5.4.0升级到v6.5.0，主库集群check检查正常，从库集群检查报错：Regions are not fully healthy: 17061 miss-peer

| username: vcdog

[TiDB Usage Environment] Pre-production environment
[TiDB Version] Upgraded from v.5.4.0 to v6.5.0
[Reproduction Path] Currently, both the master and slave clusters have been upgraded to version v6.5.0. The check command executed on the master cluster returns normal results. However, an error occurs when checking the slave cluster: Regions are not fully healthy: 17061 miss-peer
[Encountered Problem: Phenomenon and Impact]

Master cluster, execute the following command:
# tiup cluster check pre-tidb-cluster-01 --cluster
Checking region status of the cluster pre-tidb-cluster-01...
All regions are healthy.

Slave cluster, execute the following command:
# tiup cluster check test-tidb-cluster-01 --cluster
Checking region status of the cluster test-tidb-cluster-01...
Regions are not fully healthy: 17061 miss-peer
Please fix unhealthy regions before other operations.

Grafana monitoring of PD Region health in the master cluster:

Grafana monitoring of PD Region health in the slave cluster:

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

How to solve this problem? Do we need to fix those unhealthy Regions?

translator_bot · June 22, 2024, 4:56pm

| username: TI表弟 | Original post link

Two steps: find the region and remove the region.

translator_bot · June 22, 2024, 4:56pm

| username: vcdog | Original post link

There is another operational scenario that has not been mentioned:

The primary cluster upgrade operation time was last Friday.
The secondary cluster upgrade operation time was this Monday.
The primary and secondary cluster upgrades took about 3 days.

I don’t know if this is related, causing the region miss.

translator_bot · June 22, 2024, 4:56pm

| username: vcdog | Original post link

Doesn’t the region lose data?

translator_bot · June 22, 2024, 4:56pm

| username: TI表弟 | Original post link

translator_bot · June 22, 2024, 4:56pm

| username: TI表弟 | Original post link

Here is the manual for finding and removing problematic regions.

translator_bot · June 22, 2024, 4:56pm

| username: TI表弟 | Original post link

What does master-slave mean? Did you implement master-slave in TiDB?

translator_bot · June 22, 2024, 4:56pm

| username: Kongdom | Original post link

Solution from @xfworld

Then don’t upgrade for now.

Check if there are any regions with missing replicas in the existing cluster.
If these regions with missing replicas are not important, consider manually removing them.
Once the cluster status is normal, upgrade to 6.5.0.

Operation reference:
https://docs.pingcap.com/zh/tidb/stable/pd-control#region-check-miss-peer–extra-peer–down-peer–pending-peer–offline-peer–empty-region–hist-size–hist-keys

Removal method reference:

https://docs.pingcap.com/zh/tidb/stable/pd-control#恢复数据时寻找相关-region

translator_bot · June 22, 2024, 4:56pm

| username: TI表弟 | Original post link

A small amount of data may be lost.

translator_bot · June 22, 2024, 4:56pm

| username: 裤衩儿飞上天 | Original post link

It is recommended to observe, as the missing peer will be automatically replenished. Multiple replica losses will need to be handled manually.
PS: Was the missing peer caused by the upgrade? Or was there a missing peer before the upgrade?

translator_bot · June 22, 2024, 4:56pm

| username: TI表弟 | Original post link

How did you upgrade?

translator_bot · June 22, 2024, 4:56pm

| username: 考试没答案 | Original post link

Does the replica handle business operations? How large is the data volume?

translator_bot · June 22, 2024, 4:56pm

| username: 考试没答案 | Original post link

Master-slave architecture: If the master is fine but the slave database loses data, you can consider rebuilding the slave database. This is feasible if the slave database does not handle any business operations.

translator_bot · June 22, 2024, 4:56pm

| username: 考试没答案 | Original post link

Please provide the details of your upgrade process. Let’s see if the issue lies in the process.

translator_bot · June 22, 2024, 4:56pm

| username: TI表弟 | Original post link

I think the key is to find out why the replica lost data and what the differences are between the primary and replica databases.

translator_bot · June 22, 2024, 4:56pm

| username: 考试没答案 | Original post link

Yes, you can look into it more deeply. Maybe you’ll find a bug. Improve your skills!

translator_bot · June 22, 2024, 4:56pm

| username: TI表弟 | Original post link

Strongly agree with the expert’s opinion, making progress every day.

translator_bot · June 22, 2024, 4:56pm

| username: TI表弟 | Original post link

If this were to automatically recover, it should have been completed a long time ago.

translator_bot · June 22, 2024, 4:56pm

| username: vcdog | Original post link

Yes, two TiDB clusters are set up with master-slave data synchronization through TiCDC, achieving read-write separation. OLTP operations are completed in the primary cluster, while OLAP report queries are handled in the secondary cluster.

translator_bot · June 22, 2024, 4:56pm

| username: 考试没答案 | Original post link

Why not consider TiFlash in your architecture? Does your company have any special requirements?