After a PD node crashed in the afternoon and was repaired, BR backup is now extremely slow

translator_bot · June 22, 2024, 4:06pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb下午挂掉一个pd，修复后，现在发现br备份非常慢

| username: MartinTsang

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.4.2
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Problem Phenomenon and Impact]
A PD node in TiDB went down in the afternoon. After fixing it, we now find that BR backup is extremely slow. It takes almost 30 minutes to back up just 1% of a few dozen GBs of data, which is much slower than before. Restarting the cluster did not improve the situation. The resources on the servers of several nodes are normal. Not sure if the WARN log of BR has any issues.

[Resource Configuration]

[Attachments: Screenshots/Logs/Monitoring]

image1701×215 15.8 KB

Latest Supplement:

These two offline nodes belong to another cluster. At that time, a single-node TiDB was set up on 3.88 for testing, and later it was removed. However, TiDB 10.35.3.88:14000 and TiKV 10.35.3.88:21160 are still showing up here, but they are not visible through tiup cluster display. They cannot be removed using the scale-in command:

translator_bot · June 22, 2024, 4:06pm

| username: xfworld | Original post link

Check which TiKV node instance Store 17416 belongs to.

It would be best to use Grafana to see if the various metrics of the cluster are normal.

translator_bot · June 22, 2024, 4:06pm

| username: MartinTsang | Original post link

This store is at 3.88:21160, not in my cluster. My cluster’s TiKV only has:
3.131:20160
3.87:20160
3.88:20160

This 3.88:21160 was a temporary test on 3.88, a single-node TiDB cluster, which has already been removed.

translator_bot · June 22, 2024, 4:06pm

| username: ffeenn | Original post link

This TiKV store hasn’t been properly cleared.

translator_bot · June 22, 2024, 4:06pm

| username: xfworld | Original post link

Stop messing around in production, prioritize addressing the environmental issues…

translator_bot · June 22, 2024, 4:06pm

| username: ffeenn | Original post link

Check the status of all TiKV stores using database queries or pd-cli. If any are abnormal, clean them up. Since this is a production environment, make sure to back up your data properly, either through cold backups or by using image snapshots if you’re on a virtual machine.
PD Control User Guide | PingCAP Documentation Center

translator_bot · June 22, 2024, 4:06pm

| username: MartinTsang | Original post link

I can’t see those two strange nodes on the display. I just saw you mention monitoring, so I went to check and found out.

translator_bot · June 22, 2024, 4:06pm

| username: MartinTsang | Original post link

Thank you, I will take a look at the document you sent~

translator_bot · June 22, 2024, 4:06pm

| username: MartinTsang | Original post link

However, they are not in the current cluster list. How can I delete those 2 problematic nodes?

translator_bot · June 22, 2024, 4:06pm

| username: xfworld | Original post link

How did you handle the offline of these two nodes before?

Try using tiup’s cleanup capability.

translator_bot · June 22, 2024, 4:06pm

| username: h5n1 | Original post link

The cluster you tested should also be using the production PD. First, refer to the following to clean up the invisible TiKV. The normal TiKV and the abnormal TiKV on 3.88 have the same port and IP, which may affect the operation.

You can try adding the 14000 TiDB server in .tiup/storage/cluster/clusters/{cluster-name}/meta.yaml and then try scaling in.

translator_bot · June 22, 2024, 4:06pm

| username: MartinTsang | Original post link

Those two nodes were a temporary single-node TiDB deployed on 3.88, for example, cluster B, used for testing. I later destroyed the entire setup. However, through monitoring, I found that those two nodes were still left in the production cluster A.

translator_bot · June 22, 2024, 4:06pm

| username: tidb狂热爱好者 | Original post link

Production and testing use the same subnet.

translator_bot · June 22, 2024, 4:06pm

| username: WalterWj | Original post link

If the data volume is small, I recommend using Dumpling.

translator_bot · June 22, 2024, 4:06pm

| username: MartinTsang | Original post link

Okay, I will study this document first, thank you.

translator_bot · June 22, 2024, 4:06pm

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

After a PD node crashed in the afternoon and was repaired, BR backup is now extremely slow

[Attachments: Screenshots/Logs/Monitoring] image1701×215 15.8 KB

[Attachments: Screenshots/Logs/Monitoring]

image1701×215 15.8 KB