What is the compression algorithm for BR backup data?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: BR备份数据的压缩算法是怎样的?

| username: Kamner

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Encountered Issues: Problem Phenomenon and Impact】

The table size estimated by SQL Disk_Size, kv has 3 replicas

The estimated Approximate_Size of table T1 is 1.728 TB, BR backup log shows total-kv-size=1.626TB, actual backup file size is 354G, compression ratio 1.57:1

[2024/05/22 21:14:36.798 +08:00] [INFO] [collector.go:68] ["Table backup success summary"] [total-ranges=54741] [ranges-succeed=54741]
[ranges-failed=0] [backup-checksum=3h16m45.789969052s] [backup-fast-checksum=10.348692ms] [backup-total-ranges=2] [backup-total-regions
=29040] [total-take=6h0m15.779606885s] [BackupTS=449934016151814145] [total-kv=5960762744] [total-kv-size=1.626TB] [average-speed=75.23
MB/s] [backup-data-size(after-compressed)=379.2GB] [Size=379162607724]

The estimated Approximate_Size of table T2 is 3.421 TB, BR backup log shows total-kv-size=895.9GB, actual backup file size is 56G, compression ratio 5.33:1

[2024/05/23 14:27:35.566 +08:00] [INFO] [collector.go:68] ["Table backup success summary"] [total-ranges=24922] [ranges-succeed=24922]
[ranges-failed=0] [backup-checksum=3h21m58.473698844s] [backup-fast-checksum=4.196718ms] [backup-total-regions=49603] [backup-total-ran
ges=3] [total-take=5h2m40.229112126s] [BackupTS=449951169439596551] [total-kv=9834368130] [total-kv-size=895.9GB] [average-speed=49.33M
B/s] [backup-data-size(after-compressed)=59.02GB] [Size=59023982230]

1. From table T2, it can be seen that the estimated table size by SQL sometimes differs by several times the data volume, which is convenient for storage planning. Is there a more accurate method to estimate the table size?

2. The compression ratio of the backup for the two tables also varies by several times. What factors affect the compression rate of BR backups?

| username: Kamner | Original post link

Both tables are basically varchar fields, with no large object field types.

| username: zhaokede | Original post link

The Brotli compression algorithm. This compression is related to the sample values; the higher the similarity, the higher the compression rate.

| username: zhaokede | Original post link

Speculation, it’s similar to local text compression.

| username: Kamner | Original post link

Is there documentation?

| username: DBAER | Original post link

The backed-up SST file,
You can look at the introduction of this SST file, it should be using the commonly used compression of RocksDB.

| username: 像风一样的男子 | Original post link

The backup files are SST files, and the final layer should be the SZTD algorithm.

| username: yytest | Original post link

A more accurate method to estimate table size: The DATA_LENGTH, INDEX_LENGTH, and DATA_FREE fields in the INFORMATION_SCHEMA.TABLES view provided by TiDB can be used to estimate the size of a table. However, these values are calculated based on the logical data structure in TiDB and do not reflect the actual data size at the storage engine level. The actual data size may vary due to factors such as the storage engine’s compression algorithm, index structure, fragmentation, etc.

To estimate table size more accurately, you can consider the following methods:

  • Use tidb_analyze_version=2 for table analysis, which updates statistics including the number of rows and size of tables and indexes, providing more accurate data.
  • Use EXPLAIN or EXPLAIN ANALYZE to view the execution plan of queries, which can help you understand the actual data distribution of the table.
  • For BR backups, you can directly check the total-kv-size in the backup logs, which is the actual data size read during the backup process and is closer to the actual storage usage.
  • If you are using TiDB Enterprise Edition, you can consider using the Performance Monitoring feature in TiDB Dashboard to monitor the storage usage of the cluster.

Factors affecting the compression rate of BR backups: The compression rate of BR (Backup & Restore) backups is influenced by various factors, including:

  • Data type: Different types of data may have different compression characteristics. For example, text data is usually easier to compress than binary data.
  • Data redundancy: Highly redundant data (such as a large number of repeated strings or identical values) is easier to compress.
  • Data pattern: The organization of data also affects the compression rate. For example, ordered data may be easier to compress than unordered data.
  • Compression algorithm: The compression algorithm used by BR also affects the final compression rate. TiDB uses the lz4 compression algorithm by default, which provides good compression speed and moderate compression rate.
  • Backup options: BR backup options, such as whether compression is enabled, will also directly affect the final backup file size.
| username: yiduoyunQ | Original post link

After the BR restore, it directly goes to the lowest level with the highest compression rate, and you can see the table size (after compression).

| username: zhanggame1 | Original post link

I guess the compression ratio is generally most related to data duplication. If each row of data in your table is very similar, the compression ratio will be high. If they are not very similar, the compression ratio will be low. You can test this with different data in the same table.