Poor Compression Effect of writeCF Index in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIDB writeCF索引压缩效果差

| username: residentevil

[TiDB Usage Environment] Production Environment
[TiDB Version] V7.1.0
[Encountered Problem: Problem Phenomenon and Impact] The writeCF in TIKV stores index information. Although L6 and L7 layers have LZ4 and ZSTD compression algorithms, the compression effect is significantly worse than native RocksDB by several times to dozens of times.

| username: zhanggame1 | Original post link

The difference of several times to dozens of times is too outrageous.

| username: h5n1 | Original post link

How to judge and compare?

| username: 像风一样的男子 | Original post link

How did you come to that conclusion? Do you have any test data?

| username: ajin0514 | Original post link

Is there a final answer?

| username: residentevil | Original post link

After migrating 2TB of data stored using the Rocksdb engine in MYSQL 5.6 to TIDB, the single replica storage has increased to 3TB. From the monitoring, I see that the defaultCF data is 1.8TB and the writeCF is 1.2TB.

| username: residentevil | Original post link

There is no need to consider the storage occupied by MVCC in writeCF because the data is only migrated once in full. According to the official documentation, writeCF stores MVCC information and the information occupied by the index, which means that the capacity of the index has increased significantly.

| username: 有猫万事足 | Original post link

I think you can take a look at the following two monitoring graphs. With a small amount of data, it may not be in the layer with the maximum compression ratio.


These two graphs are under Grafana->tikv-details->Rockdb-kv.

If you are not satisfied with the compression at each level, it can be configured.

If you have tried everything and it is still large, there are also placement rules that can schedule cold data to cheaper, lower-performance disks. Frequently used data can be placed on high-performance SSDs.

| username: residentevil | Original post link

I suspect that the significant data bloat might be due to differences in the way the native MySQL ROCKSDB engine stores INDEX compared to TiDB’s storage of INDEX. I’ve compared the compression algorithms for each LEVEL, and they are the same. Since the data is mostly in the last two levels, the compression effect is best in these two levels.

| username: residentevil | Original post link

Compression ratio chart: write-level-6: 7.1 default-level-6: 2.4 The difference here is too big.

| username: 有猫万事足 | Original post link

It may not necessarily be the index. I think a certain course mentioned that when a piece of data exceeds a certain size, it will be written from the default CF to the write CF. If storage is really tight, it might be better to solve it through hot and cold separation.

Of course, if your original MySQL can support it, you don’t have to switch.

If the boss doesn’t even want to bear this cost, why bother with this effort? In the case of 3 replicas, if you don’t want storage costs to increase, isn’t the only way to go crazy with compression?

| username: residentevil | Original post link

If the data volume is just a few TB, a 50% increase is not a problem. However, if it reaches the PB level, a 50% increase in storage is different. The main issue is how we can know the proportion of various types of data in writeCF, and why the compression rate differs so much from defaultCF, and also why it differs so much from native ROCKSDB. Therefore, we need to understand the root cause of this.

| username: residentevil | Original post link

Let me ask again, does the value of each layer in the compression ratio represent the compression rate? What is the unit?

| username: 有猫万事足 | Original post link

The algorithm is here.

You can also try to get this value from MySQL RocksDB and compare the differences between the two.
Technically, it’s just a matter of comparing the number of keys at each level and the compression ratio at each level.
Other differences, whether larger or smaller, might be due to the different mechanisms of the two, which are difficult for you to change.

| username: residentevil | Original post link

Got it, there’s another new question. The TIUP tool actually calls PD’s interfaces to perform some cluster-related operations (such as adding or deleting TiDB nodes, PD nodes, TiKV nodes, etc.). Besides obtaining these interfaces from the source code, I couldn’t find them in the official documentation.

| username: 有猫万事足 | Original post link

It is not recommended to directly call the PD HTTP interface, as it may pose risks.

It is better to use the pdctl tool for operations.

| username: residentevil | Original post link

If you want to achieve automation, it’s better to use the HTTP protocol for easier encapsulation. Where can I find the PD interface?

| username: 有猫万事足 | Original post link

* write column: Used to store the user's actual write data and MVCC information (the start time and commit time of the transaction to which the data belongs). When a user writes a row of data, if the length of the row is less than 255 bytes, it will be stored in the write column; otherwise, the row will be stored in the default column. Since the value stored by TiDB's non-unique index is empty and the value stored by the unique index is the primary key index, secondary indexes will only occupy the space of writecf.
* default column: Used to store data longer than 255 bytes.

https://docs.pingcap.com/zh/tidb/dev/rocksdb-overview

default-level-6 is only 2.4? This is nearly half the compression ratio shown in my graph.

https://github.com/facebook/zstd

On the zstd homepage, more information is provided. Firstly, in fast mode, zstd can generally ensure a compression ratio of around 2.8. A ratio of 2.4 is below expectations.
Additionally, this page mentions that zstd performs poorly when compressing small data and recommends using a dictionary. This could be the reason why your level 6 compression ratio is lower than expected.
Rocksdb started addressing this issue in 2021.
https://github.com/facebook/rocksdb/blob/269478ee4618283cd6d710fdfea9687157a259c1/docs/_posts/2021-05-31-dictionary-compression.markdown

It is reasonable to suspect that the version of rocksdb used by tikv is lower than this 2021 version. Although it is unclear which specific version is being used, the rust-rocksdb project referenced by tikv shows that the targeted version is rocksdb6.4.
https://github.com/tikv/rust-rocksdb/

I checked. Rocksdb6.4 should be a version from 2019. Of course, it is also possible that rocksdb is already a newer version, but the dictionary-related parameters have not been set. I am not yet sure which situation it is.
As for the solution, if using lz4 in other layers results in a better compression ratio, then using lz4 is the only option to try.
Pushing tikv to upgrade the rocksdb version or pass dictionary-related parameters seems too difficult.
Of course, if you have the capability to compile and integrate it yourself, you can consider this route.
| username: residentevil | Original post link

It seems that this won’t be easy to handle in the short term. The increase in storage costs is indeed quite significant. :sweat_smile:

| username: residentevil | Original post link

Let me ask a question: The parameter max-bytes-for-level-multiplier is the default amplification factor for each level. How should this be understood, and is its effect consistent with the target_file_size_multiplier configuration in ROCKSDB? Please help me take a look.