Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 关于tidb和mysql的存储空间大小对比测试
[TiDB Usage Environment] Production Environment
[TiDB Version] 6.5
[Encountered Problem: Phenomenon and Impact]
After using the DM component for data synchronization, it was found that for the same table and the same amount of data, the size of table A in the upstream MySQL is 900M, while the size of table A in the downstream TiDB is 2G. My question is, shouldn’t TiDB’s RocksDB engine take up less space? The size difference for the entire instance is even greater. The upstream MySQL occupies a total of 1T, but the downstream TiDB occupies a total of 2T.
Is there any parameter that needs to be optimized?
TiDB has at least three replicas, so it definitely occupies more space.
How is 2G calculated?
How long is TiDB GC set for?
2G is connected through the graphical client explicitly.
GC is the default value, not set.
I don’t know what client you are using that can directly display the size of the table.
For calculating the size of a TiDB table, you can refer to: TiDB Cluster Management FAQ | PingCAP Documentation Center
If you find the official documentation inaccurate, you can refer to this post: Calculating Table Size - TiDB Technical Issues - TiDB Q&A Community (asktug.com)
I used this method for statistics. Is this SQL not applicable in TiDB?
SELECT
table_name AS 'Table Name',
table_comment AS 'Table Description',
table_rows AS 'Record Count',
TRUNCATE(data_length / 1024 / 1024, 2) AS 'Data Size (MB)',
TRUNCATE(index_length / 1024 / 1024, 2) AS 'Index Size (MB)'
FROM
information_schema.TABLES
WHERE
table_schema = 'xxx'
ORDER BY
data_length DESC,
index_length DESC;
- Obtain the proportion of regions for a specific table from tikv_region_status relative to the total regions.
- Multiply this proportion by the total size used in the monitoring, then divide by the number of replicas to get the space occupied by this table.
According to the above method, this is how I calculate it:
select count(*) from TIKV_REGION_STATUS where table_name='xxx'; // Result: 872
select count(*) from TIKV_REGION_STATUS; // Result: 73813
872 / 73813 = 0.01 // Proportion of regions occupied
642G * 3 = 1926G // Total size of 3 TiKV nodes
1926 * 0.01 = 19.26 // Total size of 3 replicas
19.26 / 3 = 6 // Size of a single replica for the table
Is this the correct understanding?
MySQL does not have high availability for a single copy of data. If you set up master-slave replication, the capacity will double. If you don’t need TiDB’s three-replica capability, you can set the replica to 1 (not recommended).
MySQL stores one copy of the data, while TiDB stores three copies of the data.
Is the data volume obtained through this method a single data volume or three separate data volumes? If we usually need to count the data volume, how should we do it?
What is the significance of doing this storage size comparison test? The more TiDB nodes there are, the smaller the storage space on a single node.
Actually, this method is also an estimate and not very accurate because the size of each region is not necessarily equal. As mentioned above, it doesn’t have much significance; having a rough estimate is sufficient, and there’s no need for extreme accuracy.
The intention is to compare the storage space of TiDB with 3 replicas to a single table in MySQL. If TiDB occupies much more space, then the storage cost would be relatively high.
Three replicas definitely use more storage space than a single replica, this shouldn’t need testing. It’s essentially trading space for time.
To achieve compaction and start the tiered compression of RocksDB, a bunch of conditions are required, such as data volume, version count, configuration parameters, etc.
The compaction action also takes time, so let the bullet fly for a while…
Comparing like this might not get you the results you want. It’s better to present the scenario you need, so everyone can help you brainstorm…
Your method is the calculation method under MySQL, which neither compresses nor has replicas. For TiDB, use the following:
SELECT
db_name,
table_name,
ROUND(SUM(total_size / cnt), 2) Approximate_Size,
ROUND(SUM(total_size / cnt / (SELECT
ROUND(AVG(VALUE), 2)
FROM
METRICS_SCHEMA.store_size_amplification
WHERE
VALUE > 0)),
2) Disk_Size
FROM
(SELECT
db_name,
table_name,
region_id,
SUM(Approximate_Size) total_size,
COUNT(*) cnt
FROM
information_schema.TIKV_REGION_STATUS
WHERE
db_name = 'sbtest'
AND table_name IN ('sbtest1')
GROUP BY db_name , table_name , region_id) tabinfo
GROUP BY db_name , table_name;
store_size_amplification
represents the average compression ratio of the cluster. Besides using the SELECT * FROM METRICS_SCHEMA.store_size_amplification;
statement to query, you can also check the Size amplification
metric under the PD - statistics balance panel in Grafana monitoring for each node to get this information. The average compression ratio of the cluster is the average of the Size amplification
of all nodes.
Approximate_Size
represents the size of a single replica of the table before compression. This value is an estimate and not accurate.
Disk_Size
represents the size of the table after compression, which can be estimated based on Approximate_Size
and store_size_amplification
.
In my tests, the size of one replica is not much different from MySQL, and three replicas are basically three times the size of MySQL. TiDB aims for horizontal scalability, which definitely won’t save more space than single-machine MySQL, but it improves performance and security.
Thank you, thank you. The SQL I used was for statistics in MySQL, which is not applicable in TiDB. After using the SQL provided by the official source, the statistics are much smaller.
Is the Disk_Size value for a single replica or multiple replicas?
The data from a database I tested: the size is directly observed from the data directory size on the disk.