How to Prepare for Evaluating the Size of a Table

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 如何准备评估某张表的大小

| username: TiDBer_rdb3y58G

As mentioned, I tried the following three queries, and the results were vastly different. How can I correctly evaluate the table size?

SELECT
    db_name,
    table_name,
    ROUND(SUM(total_size / cnt), 2) Approximate_Size,
    ROUND(SUM(total_size / cnt / (SELECT
                    ROUND(AVG(value), 2)
                FROM
                    METRICS_SCHEMA.store_size_amplification
                WHERE
                    value > 0)),
            2) Disk_Size
FROM
    (SELECT
        db_name,
            table_name,
            region_id,
            SUM(Approximate_Size) total_size,
            COUNT(*) cnt
    FROM
        information_schema.TIKV_REGION_STATUS
    WHERE
        db_name = 'test'
            AND table_name IN ('example_table2')
    GROUP BY db_name , table_name , region_id) tabinfo
GROUP BY db_name , table_name;

+---------+----------------+------------------+-----------+
| db_name | table_name     | Approximate_Size | Disk_Size |
+---------+----------------+------------------+-----------+
| test    | example_table2 |           647.00 |    599.07 |
+---------+----------------+------------------+-----------+
SELECT
    table_schema,
    table_name,
    table_rows,
    TRUNCATE(data_length / 1024 / 1024, 2) AS 'data(MB)',
    TRUNCATE(index_length / 1024 / 1024, 2) AS 'index(MB)'
FROM
    information_schema.tables
WHERE
    table_schema = 'test'
    AND table_name = 'example_table2'
ORDER BY
    data_length DESC, index_length DESC;

+--------------+----------------+------------+----------+-----------+
| table_schema | table_name     | table_rows | data(MB) | index(MB) |
+--------------+----------------+------------+----------+-----------+
| test         | example_table2 |    8000000 |   256.34 |    122.07 |
+--------------+----------------+------------+----------+-----------+
SELECT A.* 
FROM INFORMATION_SCHEMA.TABLE_STORAGE_STATS A, information_schema.tables B 
WHERE A.table_id = B.TIDB_TABLE_ID 
AND A.table_schema = 'test' 
AND A.table_name = 'example_table2';

+--------------+----------------+----------+------------+--------------+--------------------+------------+------------+
| TABLE_SCHEMA | TABLE_NAME     | TABLE_ID | PEER_COUNT | REGION_COUNT | EMPTY_REGION_COUNT | TABLE_SIZE | TABLE_KEYS |
+--------------+----------------+----------+------------+--------------+--------------------+------------+------------+
| test         | example_table2 |      178 |          3 |            1 |                  0 |         36 |     532480 |
+--------------+----------------+----------+------------+--------------+--------------------+------------+------------+
1 row in set (0.03 sec)
| username: zhanggame1 | Original post link

First, let’s look at the purpose of the evaluation: whether it’s to search for keys scanned by SQL or to refer to information_schema.TIKV_REGION_STATUS. For calculating the amount of data exported, refer to information_schema.tables. To calculate disk usage, we also need to consider the compression ratio.

| username: Kongdom | Original post link

The official documentation is the first type:

You can refer to the previous summary by the expert:

| username: zhaokede | Original post link

In addition to system tables, you also need to estimate the size based on experience.

| username: 濱崎悟空 | Original post link

It depends on the actual situation. Sometimes you look at tables, sometimes you look at table_storage_stats, and so on.

| username: tidb菜鸟一只 | Original post link

Use the first method.

| username: Jellybean | Original post link

To calculate the size of a single replica and disk space usage, the information_schema.TIKV_REGION_STATUS table is usually used, aggregating calculations from the region dimension, which is referred to as method one.

To obtain the table size from cluster statistics sampling, the information_schema.tables is typically used.

The official description of how to estimate the size of a table in TiDB uses the first method, which is relatively accurate, but the SQL might be a bit complex. If you want a rough and quick view, information_schema.tables might be relatively simpler.

| username: 友利奈绪 | Original post link

Generally, the version provided by the official source is used.

| username: ziptoam | Original post link

The query in the first solution utilizes the information_schema.TIKV_REGION_STATUS view of TiDB (or a TiDB-compatible distributed database), which provides detailed storage information for each region. This query takes into account the distribution of data across different regions and possible amplification factors (through store_size_amplification). Therefore, the result may include data amplification effects caused by replication, indexing, or other internal mechanisms, making it closer to the actual occupied storage space.

| username: Kamner | Original post link

I have also tried comparing with br export before, but I haven’t found a very accurate way to estimate the size of a table. The official SQL query should be related to the high watermark of the table.

| username: 这里介绍不了我 | Original post link

It seems that the first method recommended by the official is more accurate, as it aggregates from the region and takes the compression ratio into account. The evaluation of the second method is relatively rougher.

| username: 呢莫不爱吃鱼 | Original post link

The first one recommended by the official

| username: 霸王龙的日常 | Original post link

Currently, there is no precise calculation method to estimate the size of the data output by Dumpling. Use method two.

| username: 我是吉米哥 | Original post link

This is a very good topic.
Also, I have the same question: Does TiDB need frequent defragmentation?

| username: TiDBer_QKDdYGfz | Original post link

Good question, I’ve learned from it. I also want to know if it needs to be optimized frequently.