In TiDB, if some fields are empty, will row storage and column storage occupy space?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb中,如果某些字段为空,行存和列存会占用空间?

| username: cy6301567

In TiDB, if certain fields are empty, will row storage and column storage occupy space?

| username: Kongdom | Original post link

:thinking: I just searched, and it turns out that null values do take up space, occupying one byte. I always thought they didn’t.

| username: tidb菜鸟一只 | Original post link

Is this mentioned on the official website? I couldn’t find it.

| username: Kongdom | Original post link

I saw this somewhere else, I’ll try it out later since I have some time.

| username: Kongdom | Original post link

I verified it, and it does indeed take up storage space. :joy:
Inserting 1 million null values, the table size estimate increased from 1MB to 115MB.

CREATE TABLE `tmp_20230721001` (
  `a` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin

-- test	tmp_20230721001	1.00MB	0.08MB
SELECT
    db_name,
    table_name,
    ROUND(SUM(total_size/cnt), 2) 'Size of a single replica before compression (MB)',
    ROUND(SUM(total_size/cnt/(SELECT ROUND(AVG(value),2) FROM METRICS_SCHEMA.store_size_amplification WHERE value > 0)), 2) 'Size of the table after compression (MB)'
FROM (
	SELECT
        db_name,table_name,region_id,SUM(Approximate_Size) total_size,COUNT(*) cnt
    FROM information_schema.TIKV_REGION_STATUS
    WHERE db_name = 'test'
	AND table_name IN ('tmp_20230721001')
    GROUP BY db_name,table_name,region_id
) tabinfo
GROUP BY db_name,table_name;

INSERT INTO tmp_20230721001
SELECT NULL FROM INFORMATION_SCHEMA.TABLES a,INFORMATION_SCHEMA.TABLES b
LIMIT 1048576

SELECT count(1),1024*1024 FROM tmp_20230721001

-- test	tmp_20230721001	115.00MB	8.98MB
| username: zhanggame1 | Original post link

The test is inaccurate. This test table has a hidden primary key that takes up space. You should create 2 tables for testing.

The first table has one column: id int primary key.

The second table has two columns: id int primary key, a varchar(100). Then insert the same number of rows into each table and compare the space.

| username: Kongdom | Original post link

:astonished: Makes sense

| username: redgame | Original post link

Both row storage and column storage allocate a certain amount of storage space for fields that are empty.

| username: zhanggame1 | Original post link

It may require a larger amount of data. My test case with 1 million records doesn’t show any difference at all.

| username: h5n1 | Original post link

After version 4.0, TiDB uses the following new format for data rows:

new-row

  • version: The first byte is used to indicate the encoding version. In the old format described above, the first byte is used to indicate the type (Type) and only uses values (1-10). Therefore, we need to choose a value greater than 10 as the version number to distinguish it from the old table data encoding format.
  • flag: The next 1 byte is an extensible flag. Currently, only 1 bit is used to indicate whether it is a large row. When there is a ColumnID exceeding 255 or the total length of the Value part exceeds 65535, it will be marked as a large row and enough bytes will be used to store the column id array and offset array.
  • num of not null columns/num of null columns: The next 2 bytes record the number of non-NULL columns in the row, followed by 2 bytes recording the number of NULL columns in the row. These 4 bytes are mainly used to determine the length of the subsequent column id array.
  • not null column id array: Next is the non-NULL column ColumnID array of corresponding length. The array elements are sorted in ascending order to facilitate binary search for locating the index used by the offset array. If it is not a large row, each ColumnID requires 1 byte, otherwise, it requires 2 bytes.
  • null column id array: Next is the NULL column ColumnID array of corresponding length. The array elements are sorted in ascending order to facilitate binary search to determine if there are NULL columns. In the new format, NULL values only need to save the ColumnID without saving ColumnData.
  • not null column offset array: Specifies the end offset of the current column in the not null column data byte array. Assuming the current column is located at index i through the column id array, the data of this column will be data[offset[i-1]:offset[i]]. If it is not a large row, each offset requires 2 bytes, otherwise, it requires 4 bytes.
  • not null column data: The byte array of non-null column data is saved sequentially. It should be noted that Type information is no longer saved. Except for using int to replace varint, the format of a single column data remains consistent with the single column format in the original format.

When encoding, the row data will be organized into the defined format for writing.

When decoding, the caller will provide the ColumnID array and column type information of the required columns in the current data row as input. The decoding process will first determine whether to decode according to the new format based on the first byte version. If it is the new format, it will pre-organize the information of all segments except the last segment not null column data into flag or array in memory. Then, it will perform a binary search on the input ColumnID array in the not null column id array. If found, it will locate the data in the not null column data through the not null column offset array and decode it according to the column type information provided in the input. If not found in the not null column id array segment, it will continue to perform a binary search in the null column id array. If found, it will directly use the NULL value as the column decoding result. If still not found, it will use the default value defined by the schema as the column decoding result.

| username: Kongdom | Original post link

:+1: Professional interpretation

| username: h5n1 | Original post link

Copied

| username: cy6301567 | Original post link

Good

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.