According to the official operation documentation, follow the steps for data testing at the address:
Regarding the last step, Step 4: Synchronize Columnar Data, I have some questions:
Why doesn’t TiFlash automatically synchronize TiKV data after deployment, and instead requires manually specifying the table to enable it?
For millions of data, why is the synchronization to the columnar storage so fast? After executing ALTER TABLE test.customer SET TIFLASH REPLICA 1; it almost synchronizes immediately. Is the data synchronization from row storage to columnar storage really that fast?
The reason is that not all tables have AP requirements, and storing all tables in columnar format wastes space. TiDB is positioned as an HTAP database.
In information_schema.tiflash_replica, the PROGRESS reaching 1 indicates that the columnar storage synchronization is complete, not when the ALTER TABLE statement returns.
Thank you for your explanation, but I tested the example with 6 million data in the document, where the lineitem table has 6 million data. After I executed ALTER TABLE, I then executed SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = ‘test’ and TABLE_NAME = ‘lineitem’; to check the PROGRESS. It took about 5 seconds to reach 1, the synchronization is really fast. It took me almost half an hour to insert 6 million data into the table, but it only took 5 seconds to synchronize the data from row storage to column storage. Is it really that fast? Can you briefly explain the reason?
You can analyze the table, then check the information_schema to see how big the table is, or look at the size of the table’s region. 6 million records shouldn’t take half an hour, right? Constructing test data is too slow. You can analyze it through slow SQL. Constructing a TiFlash replica involves scanning TiKV’s region and sending it to TiFlash via snapshot. 6 million records don’t seem that large.
Because TiFlash only needs to read the idx value of the raft log to synchronize, while synchronizing tables to TiKV requires scheduling and balancing nodes through TiDB server, PD, and TiKV, and it also needs to retain multiple versions.
Thank you. Storing all tables in columnar format wastes space. How can we see the size of this space, and how can we see how much space is occupied by columnar storage and row storage respectively?
Thank you. TiFlash only needs to read the idx value of the raft log to synchronize. May I ask where this synchronization process is documented? I would like to understand it in detail.
It really can’t be checked because the underlying layer is SST files, making it impossible to confirm which table it belongs to. The data size in tikv_region_status is also estimated and there is compression involved.
It is specifically optimized for high-frequency data writing issues. If you’re interested, you can check out the source code reading series. If reading the article is tiring, there are also dedicated source code interpretation videos on Bilibili.
What I understand is that for row storage, 1000 identical records need to be stored 1000 times, while for column storage, 1000 identical records only need to be stored once. From this perspective, it is definitely faster and takes up less space.