Issues Found During Testing of TiKV Log Backup Feature in Cluster Mode for Version v6.4.0

translator_bot · June 22, 2024, 10:31pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: v6.4.0版本的TiKV日志备份功能，在集群模式下测试发现一些问题。

| username: TiDBer_YEqOV3Hn

【TiDB Usage Environment】
Testing
【TiDB Version】
v6.4.0
【Reproduction Path】
【Encountered Issues: Problem Phenomenon and Impact】
Through testing, the following issues were found in TiDB version v6.4.0 that supports log backup:

Cluster log backup to local failed, only the backup directory was generated but no backup files were produced. Single node backup to local works fine.
Six-node cluster, 3 PD 6 TiKV 6 TiDB, stopping one TiKV node, the log backup gap will increase slightly (about five minutes), but it can be caught up within ten minutes, and after catching up, there will be no significant gap increase. Full + incremental recovery works fine.
Six-node cluster, 3 PD 6 TiKV 6 TiDB, stopping two TiKV nodes, the log backup gap keeps accumulating and could not be caught up even after about two and a half hours (preliminary judgment is that it cannot be caught up).
During normal log backup, shutting down external storage ceph node, log backup shows no errors, but the gap keeps accumulating, and backup files are normally output to local cache. After ceph node is turned back on, the gap can be caught up. (ceph shutdown for ten minutes)
【Resource Configuration】
6 cloud hosts, configuration 2C4G Disk 50G
【Attachments: Screenshots/Logs/Monitoring】
The environment has been removed, only the terminal of issue point 3 can capture the log backup task information:

translator_bot · June 22, 2024, 10:31pm

| username: xfworld | Original post link

I didn’t understand… What issue are you trying to address…

translator_bot · June 22, 2024, 10:31pm

| username: TiDBer_YEqOV3Hn | Original post link

Haha, there are actually three questions I want to express:

How to manage log backup tasks? For example, how to determine if there is an exception?
If a cluster node is abnormal (tikv down) but the cluster business read and write are normal, will the log backup be abnormal (gap not caught up for two and a half hours)?
Does the current version still not support backing up tikv logs of the cluster to the local?

translator_bot · June 22, 2024, 10:31pm

| username: xfworld | Original post link

Why back up logs?
The cluster anomaly is related to the replica policy. The replica policy cannot meet the cluster scheduling, leading to replica failure. How to trace it?
I don’t understand, why back up logs? Do you want to back up WAL?

translator_bot · June 22, 2024, 10:31pm

| username: TiDBer_YEqOV3Hn | Original post link

Using your official v6.4.0 version of the PITR feature to implement full and incremental backups of the database.
What does it mean that the replica strategy cannot meet cluster scheduling, leading to replica failure? I have deployed a cluster with six machines: 3 PD, 6 TiKV, and 6 TiDB. If I sequentially shut down two machines, will this cause the cluster’s replica scheduling strategy to fail? However, the full backup is not an issue; it just takes a bit longer. During this period, the cluster can still read and write normally, so it seems that the cluster’s replica scheduling strategy is effective, right?
Not WAL.

Mainly to verify PITR,

translator_bot · June 22, 2024, 10:31pm

| username: 裤衩儿飞上天 | Original post link

Provide the deployment details and check what was running on the machine that was shut down.

translator_bot · June 22, 2024, 10:31pm

| username: xfworld | Original post link

BR is the standard backup tool.
The replica strategy is determined when the table is created, and it is not clearly described. I can’t possibly guess what it specifically looks like.
In fact, a snapshot of the SST information is taken… As shown in the figure, it means that the metadata and the actual row data are transferred to the storage.

translator_bot · June 22, 2024, 10:31pm

| username: 我是咖啡哥 | Original post link

I’ll just quietly watch the experts chat.

translator_bot · June 22, 2024, 10:31pm

| username: xfworld | Original post link

The key point is, I still don’t understand what issue you’re trying to express

translator_bot · June 22, 2024, 10:31pm

| username: TiDBer_YEqOV3Hn | Original post link

The deployment was done directly by downloading the binary package from the official website, and the configurations were not modified. The machines that were shut down were running pd+tikv+tidb and tidb+tikv respectively.

translator_bot · June 22, 2024, 10:31pm

| username: TiDBer_YEqOV3Hn | Original post link

I just want to express that the PITR feature in v6.4.0 is still unstable, right? Should we wait for a more stable version to see if these issues can be fixed?
You almost made me forget what my core question was

translator_bot · June 22, 2024, 10:31pm

| username: TiDBer_YEqOV3Hn | Original post link

The problem has actually been listed above

translator_bot · June 22, 2024, 10:31pm

| username: xfworld | Original post link

6.4 is not LTS, it is DMR.

If you want to go to production, currently 6.1 LTS is recommended.

For POC, this version is also suggested. But if you want to experience the features, 6.4 is fine, just wait a bit longer for production.

I guess you want to experiment with the backup and restore process. TiKV itself supports scheduling with multiple replicas, which can achieve replica balancing as long as there are enough nodes.

The illustration describes the backup and restore process. You can refer to the documentation to simulate it.

translator_bot · June 22, 2024, 10:31pm

| username: TiDBer_YEqOV3Hn | Original post link

I’m a newbie, seeking guidance!

translator_bot · June 22, 2024, 10:31pm

| username: TiDBer_YEqOV3Hn | Original post link

Yes, to be precise, it’s about experimenting with full backups and log backups to achieve PITR.
I should have asked my core question right away
How long will it take for PITR to be ready for production? I see that v6.4.0 just came out on the 17th of this month…

translator_bot · June 22, 2024, 10:31pm

| username: xfworld | Original post link

Wait for the next LTS.

This time might be delayed. As for the release version, it still depends on whether some issues have been fixed.

translator_bot · June 22, 2024, 10:31pm

| username: TiDBer_YEqOV3Hn | Original post link

Can you forward the question I mentioned above? Can you also fix it, haha~
The delay shouldn’t be until next year, right?

translator_bot · June 22, 2024, 10:31pm

| username: TiDBer_YEqOV3Hn | Original post link

Oh, to clarify, during my testing, it was data read and write for a single database and single table.

translator_bot · June 22, 2024, 10:31pm

| username: 裤衩儿飞上天 | Original post link

Wait for PITR GA. It seems that it hasn’t reached GA yet, and there are still quite a few issues.

translator_bot · June 22, 2024, 10:31pm

| username: xfworld | Original post link

There are two modes: snapshot and log replay. I suggest you test it again.

Single database and single table are not the issue here.

The key is whether this process can meet your expectations.