Questions about the Implementation of Column Storage in TiKV

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于TiKV行存储实现列的疑问

| username: TiDBer_U161qhfO

In the implementation of Bigtable, there are multiple concepts such as LG/CF/Qu.
In fact, in this model, what is exposed to the business is a concept similar to ‘column’, but the underlying storage is still in the form of rows.
For example:
Users use set(rowkey, cf1, “column1”, value) to read and write
The storage side uses <rowkey_cf_column1, value> to actually use leveldb for storage

In TiKV, CF is not exposed to users, and it is actually only dependent on transaction implementation. Moreover, I understand that the underlying storage is truly “row”-based? According to the document (https://tidb.net/blog/4fbd0203)
My understanding of this document is that the data format orchestrated at the TiKV underlying level is <rowkey, [value1, value2, …]>, and the encoding format of the value is as mentioned in the document. If my understanding is accurate, I have a few questions:

  1. For this organization method, why not use multiple rows as implemented in Bigtable? What are the advantages and disadvantages of each?
  2. If a user changes the data of a certain ‘column’, how is it actually implemented in TiKV? (Read Modify Write?)
  3. The compute side will push down some compute tasks to the storage layer, such as (where name=‘Tikv’), so the storage side actually knows the meaning of each column, right? The main reason for asking this question is to understand whether orchestrating this data on the business side can reduce the complexity of storage, or whether it is indeed SQL that currently manages the schema. For example, for storage, it is just K-V (storage does not know the format of V), and after reading this data, only the upper-layer SQL computation can parse this part of the data.
| username: xingzhenxiang | Original post link

TiFlash uses columnar storage.

| username: TiDBer_U161qhfO | Original post link

What I mean is, for row-based data, how does TiKV provide ‘column’ capabilities, at least from the user’s perspective?

| username: WalterWj | Original post link

In key values, the values contain column information. Changing one column means re-encoding the entire row and appending it. MVCC controls the version.

| username: TiDBer_U161qhfO | Original post link

Where is the work of encoding and decoding entire rows done, in TiKV or on the SQL side?

| username: WalterWj | Original post link

You can check out the public videos from your company, there should be an introduction.

| username: Jellybean | Original post link

SQL parsing and processing, execution plan optimization and generation, and data encoding and decoding are done on the tidb-server side, while the tikv-server is responsible for MVCC, transactions, raft peer replication, and other tasks.

| username: TiDBer_U161qhfO | Original post link

Distributed SQL Computing
“First, the predicate condition name = "TiDB" in SQL should be pushed down to the storage node for computation.”
According to your statement, for this description on the official website, if SQL is doing encoding and decoding, how can the computation of name='TiDB' be pushed down to storage? I understand that if it can be pushed down to storage, it must be decoded in storage, right? Also, I saw the TiKV source code analysis document, is it referring to the coprocessor?

| username: Jellybean | Original post link

The core operation of distributed computing here is data querying and filtering. For a given predicate condition name = "TiDB", at the tidb-server layer, based on the table’s index, column conditions, statistical information, etc., the specific retrieval process (which index to use and what range to search) can be determined through analysis and query plan generation.

You can refer to the official documentation on computation for more details. In short, it maps specific SQL queries to Key-Value queries, with the Key range determined by the predicate condition, which is part of the execution plan generation process. The specific TiKV instance where a table is located can be known by accessing PD, and then through the KV interface of tidb-server, the corresponding data is queried from the TiKV storage layer to perform various computations.

For operations like sum/count/avg/max/min in your example, the computation can be pushed down to the TiKV layer to return the results. This is known as computation pushdown to the storage layer. If the computation cannot be directly performed in TiKV, the data is returned and then computed in the tidb-server, which is a scenario where computation is not pushed down.

| username: TiDBer_U161qhfO | Original post link

The example you provided, if it involves operations like sum/count/avg/max/min, can be calculated in TiKV first and then return the result. This is what is meant by pushing down computation to the storage layer.

I still have some doubts about this. For instance, in the query select count(c) from t where a + b < 5, I see that the coprocessor documentation describes it as follows, with an illustration:

On the TiKV side, there is a calculation of PlusInt(Col(a) + Col(b)). The core issue here is, if TiKV does not know each column, how does it perform this operation? Is the example on the official website columnar storage instead of row storage? Or is there another reason?

| username: Jellybean | Original post link

TiKV is a row storage engine. Whether it is directly queried or accessed through a regular index, each time it reads data, it retrieves the entire row. Therefore, it naturally knows every column.

| username: TiDBer_U161qhfO | Original post link

Uh, my understanding is that since encoding and decoding are done at the computation layer, TiKV definitely doesn’t know what the specific columns are; the row data is just binary to TiKV. Conversely, if encoding and decoding are done at the storage layer, the specific columns would be known to TiKV. So, is there a gap in our communication here? How should we understand encoding and decoding?

| username: Jellybean | Original post link

The encoding I mentioned above refers to relational data rows, which can be mapped to (Key, Value) pairs in TiKV based on a mapping algorithm. Each row of data is encoded into (Key, Value) pairs according to the following rules:

Key:   tablePrefix{TableID}_recordPrefixSep{RowID}
Value: [col1, col2, col3, col4]

When you get this row of data, you can know each field. The encoding here is not an encryption method but a way to organize data according to certain rules for convenient storage. I suggest reading the documentation and forum articles several times for the specific rules.

Of course, TiKV actually stores binary arrays.

| username: TiDBer_U161qhfO | Original post link

Let me understand this, the encoding and decoding mentioned here refer to ‘translating SQL statements into interfaces that TiKV understands’ and ‘decoding TiKV’s returns into SQL statement returns.’

In reality, the binary format of the row value in TiKV can be parsed by TiKV itself to obtain all columns. The serialization and deserialization of this part of the format are actually unknown to the upper computation layer.

Is this understanding correct?

| username: Jellybean | Original post link

It’s roughly this meaning.

| username: TiDBer_U161qhfO | Original post link

Let me ask another question about the process. For insert into t(a, b, c) values("1", "2", "3") where id = 111; where the schema of table t is [id, a, b, c, d, e] with a total of 6 columns, what does TiDB-Server do and what does TiKV do?

  1. TiDB uses read-modify-write semantics, gets the entire row data from TiKV, updates each column, and then sends the updated entire row to TiKV.
  2. TiDB sends these three columns to TiKV, and TiKV reads the entire row content and updates it based on these three columns.

Intuitively, it seems like the second approach is used. Is this actually the case?

| username: Jellybean | Original post link

This SQL has an issue, what is the intention behind the where condition at the end?

| username: TiDBer_U161qhfO | Original post link

Oh, I made a mistake. The correct statement should be insert into t(id, a, b, c) values(111, "1", "2", "3"). The id can be a primary index, and there is also a column d in table t. From the documentation, it seems that the entire row of data needs to be serialized and then written to TiKV. So my question is, how is the column d obtained? Does TiDB read the data and then append d for serialization (WriteModifyWrite)? Is the read-write atomicity guaranteed by TiKV’s two-phase commit? I didn’t see the part of the source code that reads the row from TiKV, and somehow it just adds the record. Or is it that there is only write semantics, and the column d is specially marked and the actual merge is done in TiKV (just guessing)?

| username: Jellybean | Original post link

Roughly speaking, the writing process is as follows:

  • If it is a new row, it directly writes the encoded KV key-value pair, performing a put operation.
  • If it is updating existing data (updating some fields), it first reads the entire row, then uses the updated column values and the old, unchanged column values, along with the latest timestamp to generate a new row, and then performs a put operation. The new and old data have different versions based on the timestamp, which is the basic prototype of MVCC. The old data is cleaned up during GC or Compaction operations.
  • If it is a delete operation, it also uses the old data row + the latest timestamp + delete marker to perform a put operation. The old data is cleaned up during GC or Compaction operations.

Transaction guarantees are achieved through 2PC, and in newer versions of TiDB, 1PC is also implemented for special scenarios.

| username: TiDBer_U161qhfO | Original post link

Reading the entire row is done on the TiDB-Server side, right? The underlying storage of TiKV always stores all columns, rather than storing partial columns, and the storage does the merge, correct?
Also, I haven’t looked at this part of the code, https://github.com/pingcap/tidb/blob/master/executor/insert_common.go#L226. Could you please tell me where the reading of the entire row is done?