The actual usage of TiKV differs significantly from the actual size of all databases discovered through SQL queries

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv实际使用量和通过SQL查询发现的所有库的实际大小相差甚多

| username: ks_ops_ms

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.4.0
[Reproduction Path] Operations performed that led to the issue
[Encountered Issue: Issue Phenomenon and Impact]
[Resource Configuration] Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
Query to view the size of all databases via SQL:

SELECT table_schema AS "Database",
ROUND(SUM(data_length + index_length) / 1024 / 1024 / 1024, 2) AS "Size (GB)"
FROM information_schema.TABLES
GROUP BY table_schema;


Monitoring status of each TiKV

The SQL query shows the database size to be around 3.5T, but TiKV disk monitoring indicates approximately 4.8T used.
Upon entering the TiKV container, it was found that the db directory size for each TiKV is 1.4T, and other rocksdb.info files are about 140GB each.
There is a significant discrepancy between actual usage and current storage.

| username: 大飞哥online | Original post link

When querying the size of all databases in SQL, the result is the logical size of the data, which includes indexes, metadata, and other overhead.

| username: 大飞哥online | Original post link

The actual usage of TiKV takes into account the physical storage space occupied by the data. Due to compression and other optimizations, the actual usage will be smaller. There will also be raft logs and other information.

| username: 像风一样的男子 | Original post link

The actual storage of KV is compressed, roughly three times. Additionally, since data storage is in three replicas, this results in the disk statistics being roughly similar to the SQL statistics.

| username: tidb菜鸟一只 | Original post link

Try looking at it this way.

| username: dba远航 | Original post link

The actual storage of TiKV does not correspond to the actual data volume. TiDB uses space to trade for write time, and any DML is implemented by writing new data. Additionally, data starts to compress once it enters level 1. All of these factors make it impossible to correspond to the real data.

| username: zhanggame1 | Original post link

TiDB data is saved in three copies and is stored in a compressed format, which is different from what you see in information_schema.TABLES; that is statistical information. The time consumption can be calculated by using show table regions to get the size of the regions and then dividing by the compression ratio.

| username: ks_ops_ms | Original post link

The unit of disk usage seen in this SQL is not very clear.

| username: tidb菜鸟一只 | Original post link

The unit is M.

| username: ks_ops_ms | Original post link

In other words, the data on each of the three KV nodes is identical, but the data in each KV is compressed. So, for example, the total size of all the databases I see through SQL is over 3TB, but after compression, it is only about 1.6TB on the disk.

| username: 像风一样的男子 | Original post link

The most accurate way to query the amount of data is to use Dumpling to logically export the data into SQL or CSV. This way, there is no multi-level compression or multiple replicas. Other calculations are estimates and are not accurate.