Discussion on Partitioned Raft KV Features

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: Partitioned Raft KV 特性讨论

| username: Jellybean

TiDB v6.6.0 introduced the Partitioned Raft KV storage engine as an experimental feature. This engine uses multiple RocksDB instances to store TiKV’s Region data, with each Region’s data stored independently in a separate RocksDB instance. In TiDB v7.4.0, the Partitioned Raft KV engine has seen further improvements in compatibility and stability.

I have a few questions to discuss with everyone:

  1. A Region exclusively occupies a RocksDB instance. Assuming a cluster has tens of thousands of Regions, there will be tens of thousands of RocksDB instances.
  • How is the management of so many instances considered?
  • Each RocksDB uses MemTable, etc., and memory usage may increase sharply, leading to a higher risk of OOM?
  1. Previously, all Regions on a TiKV instance were written into a single RocksDB. When the data volume was large (TB level), there was a significant write amplification issue. Now, with Partitioned-Raft-KV, there will no longer be TB-level RocksDB instances because there won’t be TB-level Regions.
    So, will the write amplification issue be greatly improved?

  2. We use the Placement Rule in SQL strategy and large Regions to achieve massive data storage. Currently, there is a practice of storing over 15TB in a single RocksDB (on HDD). If replaced with Partitioned Raft KV, will there be improvements in compression rate and disk space usage? How should this be understood?

| username: h5n1 | Original post link

My humble opinion:

  1. After enabling Partitioned KV, the default region-split-size is 10G, and the region-max-size is 15G, which will significantly reduce the number of regions. The memtable has a flush parameter that controls write-buffer-flush-oldest-first, determining whether to flush the oldest or the largest. The write-buffer-limit controls the total memory usage of all memtables.
  2. I don’t think write amplification will improve much. Different tables will trigger compaction at different times based on their write patterns. For example, tables with a lot of historical data are rarely written to, so they compact less frequently, resulting in less IO. In the original method, each compaction might read and rewrite these data. Additionally, enabling Partitioned KV will stop writing to WAL, saving IO.
  3. The previous method used large regions by adjusting parameter sizes, but there were limitations. I remember there was a 4G limit related to region size in TiKV transmission. The implementation of Partitioned KV has a dynamic region size, which should address the previous limitations. The compression rate should remain the same since the compression algorithm is still RocksDB. Disk space might be more efficiently utilized for compaction. In previous versions, I noticed that the number of deletes after GC was not compacted in a timely manner. I believe there should at least be a per-table compaction feature, which would minimize the impact on other tables.

| username: TiDBer_小阿飞 | Original post link

My head is buzzing from reading this.

| username: Jellybean | Original post link

Thank you for the response.

  1. The new engine is likely to trigger OOM more easily. The documentation also mentions some optimization methods, such as reducing the size of each RocksDB memTable to mitigate this phenomenon.
    Drawbacks
    This RFC may be easier to OOM compared to existing architecture.

  2. It’s hard to judge write amplification, but overall, the IO consumption of the cluster is indeed reduced, significantly reducing unnecessary data scans and Compaction.
    After enabling Partitioned Raft KV, WAL is no longer written. This operation should have other mechanisms to avoid data loss in case of a crash, otherwise, the risk would be too high.

  3. From the perspective of compression rate and space usage, there is not much change after using Partitioned Raft KV. However, from the perspective of IO consumption, there are significant optimizations, which have considerable application value for cold and hot archival storage solutions. Using the Placement Rule in SQL strategy + Partitioned Raft KV not only ensures that most data is stored on HDD but also ensures that IO consumption is alleviated. Delving into this could be a good application approach.

| username: h5n1 | Original post link

I still don’t know how the block cache is managed and allocated.

| username: Jellybean | Original post link

Yes, the underlying core engine has changed a lot and is different from before. The official team needs to provide detailed documentation on this; otherwise, it’s hard to confidently adopt this feature.

If the official team is too busy, it would also be a good idea for you, the moderators, to provide relevant documentation in the community. :grinning:

| username: Fly-bird | Original post link

Reading this makes my head buzz. Learning from the experts.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.