Comparison Test of Storage Space Size Between TiDB and MySQL

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于tidb和mysql的存储空间大小对比测试

| username: terry0219

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5
[Encountered Problem: Phenomenon and Impact]
After using the DM component for data synchronization, it was found that for the same table and the same amount of data, the size of table A in the upstream MySQL is 900M, while the size of table A in the downstream TiDB is 2G. My question is, shouldn’t TiDB’s RocksDB engine take up less space? The size difference for the entire instance is even greater. The upstream MySQL occupies a total of 1T, but the downstream TiDB occupies a total of 2T.
Is there any parameter that needs to be optimized?

| username: tidb菜鸟一只 | Original post link

TiDB has at least three replicas, so it definitely occupies more space.

| username: 裤衩儿飞上天 | Original post link

How is 2G calculated?
How long is TiDB GC set for?

| username: terry0219 | Original post link

2G is connected through the graphical client explicitly.
GC is the default value, not set.

| username: 裤衩儿飞上天 | Original post link

I don’t know what client you are using that can directly display the size of the table.

For calculating the size of a TiDB table, you can refer to: TiDB Cluster Management FAQ | PingCAP Documentation Center

If you find the official documentation inaccurate, you can refer to this post: Calculating Table Size - :ringer_planet: TiDB Technical Issues - TiDB Q&A Community (asktug.com)

| username: terry0219 | Original post link

I used this method for statistics. Is this SQL not applicable in TiDB?

SELECT
	table_name AS 'Table Name',
	table_comment AS 'Table Description',
	table_rows AS 'Record Count',
	TRUNCATE(data_length / 1024 / 1024, 2) AS 'Data Size (MB)',
	TRUNCATE(index_length / 1024 / 1024, 2) AS 'Index Size (MB)'
FROM
	information_schema.TABLES
WHERE
	table_schema = 'xxx'
ORDER BY
	data_length DESC,
	index_length DESC;
| username: terry0219 | Original post link

  1. Obtain the proportion of regions for a specific table from tikv_region_status relative to the total regions.
  2. Multiply this proportion by the total size used in the monitoring, then divide by the number of replicas to get the space occupied by this table.

According to the above method, this is how I calculate it:

select count(*) from TIKV_REGION_STATUS where table_name='xxx';    // Result: 872
select count(*) from TIKV_REGION_STATUS;  // Result: 73813

872 / 73813 = 0.01 // Proportion of regions occupied
642G * 3 = 1926G // Total size of 3 TiKV nodes
1926 * 0.01 = 19.26 // Total size of 3 replicas
19.26 / 3 = 6 // Size of a single replica for the table

Is this the correct understanding?

| username: yiduoyunQ | Original post link

MySQL does not have high availability for a single copy of data. If you set up master-slave replication, the capacity will double. If you don’t need TiDB’s three-replica capability, you can set the replica to 1 (not recommended).

| username: Kongdom | Original post link

MySQL stores one copy of the data, while TiDB stores three copies of the data.

| username: 胡杨树旁 | Original post link

Is the data volume obtained through this method a single data volume or three separate data volumes? If we usually need to count the data volume, how should we do it?

| username: Kongdom | Original post link

What is the significance of doing this storage size comparison test? The more TiDB nodes there are, the smaller the storage space on a single node.

| username: 裤衩儿飞上天 | Original post link

Actually, this method is also an estimate and not very accurate because the size of each region is not necessarily equal. As mentioned above, it doesn’t have much significance; having a rough estimate is sufficient, and there’s no need for extreme accuracy.

| username: terry0219 | Original post link

The intention is to compare the storage space of TiDB with 3 replicas to a single table in MySQL. If TiDB occupies much more space, then the storage cost would be relatively high.

| username: Kongdom | Original post link

:sweat_smile: Three replicas definitely use more storage space than a single replica, this shouldn’t need testing. It’s essentially trading space for time.

| username: xfworld | Original post link

To achieve compaction and start the tiered compression of RocksDB, a bunch of conditions are required, such as data volume, version count, configuration parameters, etc.

The compaction action also takes time, so let the bullet fly for a while…

Comparing like this might not get you the results you want. It’s better to present the scenario you need, so everyone can help you brainstorm…

| username: tidb菜鸟一只 | Original post link

Your method is the calculation method under MySQL, which neither compresses nor has replicas. For TiDB, use the following:

SELECT
    db_name,
    table_name,
    ROUND(SUM(total_size / cnt), 2) Approximate_Size,
    ROUND(SUM(total_size / cnt / (SELECT
                    ROUND(AVG(VALUE), 2)
                FROM
                    METRICS_SCHEMA.store_size_amplification
                WHERE
                    VALUE > 0)),
            2) Disk_Size
FROM
    (SELECT
        db_name,
            table_name,
            region_id,
            SUM(Approximate_Size) total_size,
            COUNT(*) cnt
    FROM
        information_schema.TIKV_REGION_STATUS
    WHERE
        db_name = 'sbtest'
            AND table_name IN ('sbtest1')
    GROUP BY db_name , table_name , region_id) tabinfo
GROUP BY db_name , table_name;
  • store_size_amplification represents the average compression ratio of the cluster. Besides using the SELECT * FROM METRICS_SCHEMA.store_size_amplification; statement to query, you can also check the Size amplification metric under the PD - statistics balance panel in Grafana monitoring for each node to get this information. The average compression ratio of the cluster is the average of the Size amplification of all nodes.
  • Approximate_Size represents the size of a single replica of the table before compression. This value is an estimate and not accurate.
  • Disk_Size represents the size of the table after compression, which can be estimated based on Approximate_Size and store_size_amplification.

In my tests, the size of one replica is not much different from MySQL, and three replicas are basically three times the size of MySQL. TiDB aims for horizontal scalability, which definitely won’t save more space than single-machine MySQL, but it improves performance and security.

| username: terry0219 | Original post link

Thank you, thank you. The SQL I used was for statistics in MySQL, which is not applicable in TiDB. After using the SQL provided by the official source, the statistics are much smaller.

| username: terry0219 | Original post link

Is the Disk_Size value for a single replica or multiple replicas?

| username: tidb菜鸟一只 | Original post link

Single replica

| username: 我是咖啡哥 | Original post link

The data from a database I tested: the size is directly observed from the data directory size on the disk.