How to Prepare for Evaluating the Size of a Table

translator_bot · June 20, 2024, 1:07pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 如何准备评估某张表的大小

| username: TiDBer_rdb3y58G

As mentioned, I tried the following three queries, and the results were vastly different. How can I correctly evaluate the table size?

SELECT
    db_name,
    table_name,
    ROUND(SUM(total_size / cnt), 2) Approximate_Size,
    ROUND(SUM(total_size / cnt / (SELECT
                    ROUND(AVG(value), 2)
                FROM
                    METRICS_SCHEMA.store_size_amplification
                WHERE
                    value > 0)),
            2) Disk_Size
FROM
    (SELECT
        db_name,
            table_name,
            region_id,
            SUM(Approximate_Size) total_size,
            COUNT(*) cnt
    FROM
        information_schema.TIKV_REGION_STATUS
    WHERE
        db_name = 'test'
            AND table_name IN ('example_table2')
    GROUP BY db_name , table_name , region_id) tabinfo
GROUP BY db_name , table_name;

+---------+----------------+------------------+-----------+
| db_name | table_name     | Approximate_Size | Disk_Size |
+---------+----------------+------------------+-----------+
| test    | example_table2 |           647.00 |    599.07 |
+---------+----------------+------------------+-----------+

SELECT
    table_schema,
    table_name,
    table_rows,
    TRUNCATE(data_length / 1024 / 1024, 2) AS 'data(MB)',
    TRUNCATE(index_length / 1024 / 1024, 2) AS 'index(MB)'
FROM
    information_schema.tables
WHERE
    table_schema = 'test'
    AND table_name = 'example_table2'
ORDER BY
    data_length DESC, index_length DESC;

+--------------+----------------+------------+----------+-----------+
| table_schema | table_name     | table_rows | data(MB) | index(MB) |
+--------------+----------------+------------+----------+-----------+
| test         | example_table2 |    8000000 |   256.34 |    122.07 |
+--------------+----------------+------------+----------+-----------+

SELECT A.* 
FROM INFORMATION_SCHEMA.TABLE_STORAGE_STATS A, information_schema.tables B 
WHERE A.table_id = B.TIDB_TABLE_ID 
AND A.table_schema = 'test' 
AND A.table_name = 'example_table2';

+--------------+----------------+----------+------------+--------------+--------------------+------------+------------+
| TABLE_SCHEMA | TABLE_NAME     | TABLE_ID | PEER_COUNT | REGION_COUNT | EMPTY_REGION_COUNT | TABLE_SIZE | TABLE_KEYS |
+--------------+----------------+----------+------------+--------------+--------------------+------------+------------+
| test         | example_table2 |      178 |          3 |            1 |                  0 |         36 |     532480 |
+--------------+----------------+----------+------------+--------------+--------------------+------------+------------+
1 row in set (0.03 sec)

translator_bot · June 20, 2024, 1:07pm

| username: zhanggame1 | Original post link

First, let’s look at the purpose of the evaluation: whether it’s to search for keys scanned by SQL or to refer to information_schema.TIKV_REGION_STATUS. For calculating the amount of data exported, refer to information_schema.tables. To calculate disk usage, we also need to consider the compression ratio.

translator_bot · June 20, 2024, 1:07pm

| username: Kongdom | Original post link

The official documentation is the first type:

You can refer to the previous summary by the expert:

translator_bot · June 20, 2024, 1:07pm

| username: zhaokede | Original post link

In addition to system tables, you also need to estimate the size based on experience.

translator_bot · June 20, 2024, 1:07pm

| username: 濱崎悟空 | Original post link

It depends on the actual situation. Sometimes you look at tables, sometimes you look at table_storage_stats, and so on.

translator_bot · June 20, 2024, 1:07pm

| username: tidb菜鸟一只 | Original post link

Use the first method.

translator_bot · June 20, 2024, 1:07pm

| username: Jellybean | Original post link

To calculate the size of a single replica and disk space usage, the information_schema.TIKV_REGION_STATUS table is usually used, aggregating calculations from the region dimension, which is referred to as method one.

To obtain the table size from cluster statistics sampling, the information_schema.tables is typically used.

The official description of how to estimate the size of a table in TiDB uses the first method, which is relatively accurate, but the SQL might be a bit complex. If you want a rough and quick view, information_schema.tables might be relatively simpler.

translator_bot · June 20, 2024, 1:07pm

| username: 友利奈绪 | Original post link

Generally, the version provided by the official source is used.

translator_bot · June 20, 2024, 1:07pm

| username: ziptoam | Original post link

The query in the first solution utilizes the information_schema.TIKV_REGION_STATUS view of TiDB (or a TiDB-compatible distributed database), which provides detailed storage information for each region. This query takes into account the distribution of data across different regions and possible amplification factors (through store_size_amplification). Therefore, the result may include data amplification effects caused by replication, indexing, or other internal mechanisms, making it closer to the actual occupied storage space.

translator_bot · June 20, 2024, 1:07pm

| username: Kamner | Original post link

I have also tried comparing with br export before, but I haven’t found a very accurate way to estimate the size of a table. The official SQL query should be related to the high watermark of the table.

TiDB 的问答社区 – 23 May 24

BR备份数据的压缩算法是怎样的？

🪐 TiDB 技术问题备份&数据迁移

【 TiDB 使用环境】生产环境【 TiDB 版本】【遇到的问题：问题现象及影响】 https://docs.pingcap.com/zh/tidb/stable/manage-cluster-faq#如何预估-tidb-中一张表的大小通过SQL预估的表大小Disk_Size 大小，kv是3副本 T1表预估的Approximate_Size 是1.728 TB，BR备份日志显示total-kv-size=1.626TB，备份文件实际大小是 354G，压缩比 1.57：1...

阅读时间: 1 mins 🕑 赞: 3 ❤

translator_bot · June 20, 2024, 1:07pm

| username: 这里介绍不了我 | Original post link

It seems that the first method recommended by the official is more accurate, as it aggregates from the region and takes the compression ratio into account. The evaluation of the second method is relatively rougher.

translator_bot · June 20, 2024, 1:07pm

| username: 呢莫不爱吃鱼 | Original post link

The first one recommended by the official

translator_bot · June 20, 2024, 1:07pm

| username: 霸王龙的日常 | Original post link

Currently, there is no precise calculation method to estimate the size of the data output by Dumpling. Use method two.

translator_bot · June 20, 2024, 1:07pm

| username: 我是吉米哥 | Original post link

This is a very good topic.
Also, I have the same question: Does TiDB need frequent defragmentation?

translator_bot · June 20, 2024, 1:07pm

| username: TiDBer_QKDdYGfz | Original post link

Good question, I’ve learned from it. I also want to know if it needs to be optimized frequently.

translator_bot · September 24, 2024, 9:49am

| username: 时空旅行者 | Original post link

It is recommended to use the official version.

translator_bot · September 24, 2024, 9:49am

| username: cchouqiang | Original post link

The first one is provided by the official source and is accurate.

translator_bot · September 24, 2024, 9:49am

| username: kevinsna | Original post link

The official recommendation is to use the first method: TiDB Cluster Management FAQ | TiDB Documentation Center (pingcap.com)

translator_bot · September 27, 2024, 9:54am

| username: 舞动梦灵 | Original post link

The official method is the first one, which shows the size after compression. TiDB data is compressed by default. Approximate_Size is the size before compression, and disk_size is the actual disk space occupied after compression.
The second method is the way MySQL queries table size. The calculation method may differ from TiDB.

translator_bot · September 27, 2024, 9:54am

| username: residentevil | Original post link

There are data and index fields in information_schema.tables for querying.