Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 关于TiDB写入性能
Based on principle analysis only, there is no actual environment for now. I am thinking about a question: if the data in the same table is written in the order of the primary key, will there be a period of single-machine operation when writing data to each region (because they are all routed to the same node, and due to the data order, the keys are also sequential)? It splits at 98M, and after splitting, it continues to write sequentially on a single machine. How can distributed write performance be ensured in this case?
If there are no special settings, incremental writes only write to the last region. Therefore, TiDB has multiple ways to handle this. If the primary key is a big int, you can use auto_random to replace auto_increment. If it is not a big int primary key, you can use shard_row_id_bits to scatter the data. If neither is suitable, you can consider hash partitioning. The above methods are for tables without using indexes. Currently, there is no good way to scatter indexes, and the shard index has many limitations. You can search for related keywords for more information.
Tables are divided into clustered and non-clustered tables. If it is non-clustered, the primary key is an auto-increment field using auto_random. If it is a clustered table, look at the primary key type and scatter the table according to the rules when creating it. The official documentation and courses provide detailed methods.
I think what I’m talking about is the scenario of a clustered table, such as this batch of data being written:
[1, “aaa”, 18] → key=t1_row1
[2, “aaa2”, 10] → key=t1_row2
…
[10000, “aaa10000”, 1000] → key=t1_row10000
If the data is scattered, the efficiency of range queries, such as querying data from 500-600, would be low. If not scattered, as mentioned above, it would keep writing to a single region. This is where the main concern lies.
I understand your point. The biggest advantage of TiDB over relational databases is concurrency. Therefore, distributing data by hash is better in performance than by range. Query pushdown to TiKV is concurrent querying, retrieving results from multiple TiKVs is faster than querying from a single TiKV. For example, if the histogram for the 500-600 range has a large proportion of data, and a single value occupies one region, then 100 regions will participate in this range scan, which means 100 TiKVs (in an ideal situation) will participate in the scan, making it faster than a single TiKV. This should be distinguished from partition tables in relational databases.
Is my understanding correct?
Actually, there is no perfect solution for this.
- Either sequential data is written to one region, which has good write performance without hash scattering. But this has an advantage in range queries.
- Or data is scattered to ensure write performance. During batch queries, the advantage of distributed concurrency compensates for the situation where data is not concentrated together and requires multiple queries.
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.