How is the TiKV index built?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV索引是怎么建立的

| username: TiDBer_JlY1JCJ5

How is the index in TiKV in TiDB established, and is it the same as the way MySQL establishes indexes?
TiDB’s storage engine is Key-Value, while MySQL uses InnoDB. What are the advantages and disadvantages of Key-Value compared to InnoDB, and why didn’t TiDB choose InnoDB?

| username: oceanzhang | Original post link

It’s different. TiDB indexes are also key-value pairs. Compared to InnoDB, TiDB has faster insertion but slower reading. However, some say that the single-machine insertion performance is not as good as MySQL.

| username: 小龙虾爱大龙虾 | Original post link

You can take a look at this article

| username: Kongdom | Original post link

:thinking: I didn’t see how the index is created. I also want to know what the structure of the index is.

| username: zhanggame1 | Original post link

Since they are all KV structures, I guess the index should have the key as the index ID + index value, and the value as the row ID of the data. The index key is arranged in order.

| username: 小龙虾爱大龙虾 | Original post link

Like table data, indexes are also stored in TiKV as key-value pairs.

| username: okenJiang | Original post link

Each row has a rowid

Key-value mapping relationship:

  1. Index id => rowid
  2. rowid => row
| username: forever | Original post link

Sorry, I can’t assist with that.

| username: Kongdom | Original post link

Yes, that’s it, thank you.

| username: TiDBer_JlY1JCJ5 | Original post link

Why is the value of the secondary index null?

| username: TiDBer_JlY1JCJ5 | Original post link

May I know where this document originates from?

| username: 小龙虾爱大龙虾 | Original post link

In key-value storage, to ensure the uniqueness of the key, the rowid is encoded into the secondary index. As a result, all the information is stored in the key, and there’s nothing left to store in the value, so it becomes null. :rofl:

| username: TiDBer_JlY1JCJ5 | Original post link

Why is it said that for secondary indexes, a key value may correspond to multiple rows, while for unique indexes, a key value will not correspond to multiple rows? I don’t quite understand this.

| username: 小龙虾爱大龙虾 | Original post link

A unique index means that the indexed column is unique, while a secondary index refers to a regular index, which may have rows with the same values.
May I ask what position you hold?

| username: forever | Original post link

TiDB Database Computing | PingCAP Documentation Center

| username: 春风十里 | Original post link

You can check out this article:

  1. TiDB Index Design

  2. Index Design Principles
    TiDB was initially designed to provide horizontal scalability while being compatible with MySQL, addressing the issue of sharding. As it evolved, it needed to support high concurrency reads and writes, consistency with multiple replicas, high availability, and distributed transactions. Based on these requirements, an appropriate index data structure was chosen.

  3. LSM-Tree (Log Structured Merge Tree)


    Data is divided into two parts: one in memory and one on disk. The memory part consists of Memtable and Immutable Memtable. The disk part consists of multiple levels of SSTables, with older data at higher levels.

WAL (Write-Ahead Log) logs are appended-only log structure files used for replaying operations during system crashes to ensure data in MemTable and Immutable MemTable is not lost.

SSTable:
Key-value storage, ordered, containing a series of configurable-sized blocks (typically 64KB). The index of these blocks is stored at the end of the SSTable for quick lookup. When an SSTable is opened, the index is loaded into memory, allowing binary search within the memory index to find the disk offset of the key, and then reading the corresponding block from the disk. If memory is large enough, the SSTable can be mapped directly into memory using MMap for faster lookup.

MemTable is often an ordered data structure organized as a skip list.

Writing: Compared to B+ trees, when the data volume is very large, B+ trees suffer from performance issues due to page splits and merges during updates and deletions. LSM-Tree, on the other hand, splits data into a series of ordered SSTables written to disk like logs, without modification. When modifying existing data, LSM-Tree writes new data to a new SSTable instead of modifying old data. This ensures stable write speed regardless of data volume, as it involves appending logs and writing to a small-sized memtable.

Deleting data in LSM-Tree involves writing a delete marker to a new SSTable. This ensures sequential block writes to disk without random writes. When MemTable reaches a certain size (typically 64KB), it is frozen into an Immutable MemTable.

Write process:

  1. Write to WAL log for fault recovery.
  2. Simultaneously write to Memtable in memory.
  3. Dump the Immutable Memtable to SSTable on disk (Minor Compaction).
  4. Periodically merge SSTables when their size or number exceeds a threshold (Major Compaction), removing deleted data and merging multiple versions to save space.

Reading: Search in MemTable, then Immutable MemTable, then level 0 SSTable, and finally level N SSTable. Merge multiple SSTables into one, removing modified or deleted data to reduce the number of SSTables. If the target data is in the lowest level N SSTable, all SSTables need to be read and searched, causing read amplification.

Optimization methods:

  • Compress data blocks.
  • Use Bloom filters to quickly determine if data is not in an SSTable.
  • Cache immutable SSTables in memory.
  • Merge multiple SSTables into one.

SSTable:
Merging involves reading SSTables into memory and writing the merged SSTable to disk, increasing disk IO and CPU consumption, known as write amplification.

Set data size thresholds for each level.
image
SSTable file structure:
File format:
<beginning_of_file>
[data block 1]
[data block 2]

[data block N]
[meta block 1: filter block]
[meta block 2: index block]
[meta block 3: compression dictionary block]
[meta block 4: range deletion block]
[meta block 5: stats block]

[meta block K: future extended block]
[metaindex block]
[Footer]
<end_of_file>

  1. Differences between TiDB and MySQL Indexes

LSM-Tree splits data into multiple small arrays for sequential writes, approximately O(1). B+Tree splits data into fixed-size blocks or pages (typically 4KB), corresponding to a disk sector size, with write complexity O(logN). With inserts, B+Tree nodes split and merge, increasing random disk read/write operations and reducing performance. B+Tree supports in-place updates and deletions, making it more transaction-friendly as a key appears in only one page. LSM-Tree only supports append writes, with key ranges overlapping in L0, making it less transaction-friendly, with updates and deletions occurring during Segment Compaction.

RocksDB splits disk-stored files into blocks (default 64KB). When reading a block, it first checks if the block is in the BlockCache in memory, avoiding disk access if present.

Supports: Primary key index, unique index, and regular index.
Does not support FULLTEXT, HASH, and SPATIAL indexes.
Does not support descending indexes.
Does not support adding multiple indexes in one statement.

  1. Differences from MySQL

  2. AUTO_INCREMENT
    Auto-increment values are allocated in batches to each TiDB server (default 30,000 values). Each TiDB node requests a segment of IDs as a cache, requesting the next segment when exhausted, rather than requesting an ID from the storage node each time. The order of values is not globally monotonic.

  3. DDL
    Cannot perform multiple operations in a single ALTER TABLE statement. For example, cannot add multiple columns or indexes in a single statement, otherwise, an “Unsupported multi schema change” error may occur. Does not support modifying column types on partitioned tables.

  4. Partitioned Tables
    Supports Hash, Range, and Add/Drop/Truncate/Coalesce partitioning. Other partition operations are ignored and may result in a “Warning: Unsupported partition type, treat as normal table” error. Does not support the following partition syntax:
    PARTITION BY LIST
    PARTITION BY KEY
    SUBPARTITION
    {CHECK|EXCHANGE|TRUNCATE|OPTIMIZE|REPAIR|IMPORT|DISCARD|REBUILD|REORGANIZE} PARTITION

  5. Locks
    TiDB currently does not support gap locking.


Copyright Notice: This article is an original work by CSDN blogger “HT c++”, following the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement when reprinting.
Original link: TiDB索引_tidb 索引-CSDN博客

| username: 随缘天空 | Original post link

Master, it seems like everyone is studying the underlying principles.

| username: TiDBer_gxUpi9Ct | Original post link

Got it.

| username: oceanzhang | Original post link

The indexing and table data in TiDB are both converted to key-value pairs. However, some keys have empty values, and there are slight differences in key generation.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.