Issues with Primary Key Prefix Matching in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于tidb主键前缀匹配的问题

| username: SnowFlakeLeaf

[TiDB Usage Environment] Production Environment
[TiDB Version] 5.4

Dear all, if the primary key uses the business’s unique key, how is the performance when using prefix matching for queries? Is there an advantage compared to precise queries with a regular secondary index?

Background: There is a field unique_id, and the order_no field is its fixed prefix. For example, unique_id=‘test-123’, order_no=‘test’. [unique_id is not written in sequential order; it is unordered. It is almost impossible for a large number of unique_ids starting with the same order_no to be written in a short period, so there is no hotspot issue. Multiple unique_ids corresponding to one order_no are usually fewer than 10.] In the business, order_no is a high-frequency query scenario, but order_no is not unique and cannot be directly used as a primary key and clustered index.

Scenario 1: The unique_id field is used as the primary key, and queries are performed using prefix matching on unique_id.
Scenario 2: A randomly generated ID is used as the primary key, an index is created on order_no, and precise matching queries are performed using order_no.

Which of these two scenarios theoretically has better performance? Will there be a significant difference in search performance with data storage at the scale of tens of billions? Thank you.

| username: weixiaobing | Original post link

If scenario one is prefix matching, then the performance should not be as good as scenario two.

| username: SnowFlakeLeaf | Original post link

Our prefix matching may not retrieve a large amount of data, less than or equal to 10 entries. According to the official documentation in the storage section: “Keys are divided into Ranges, and a continuous segment of Keys is stored on a single storage node.” Additionally, in the clustered index section, it states: “Equality or range condition queries involving only the prefix of the primary key will reduce multiple network data reads.” This is why I have doubts about this point. If in most scenarios the results of prefix matching are within a single Region, it seems that compared to scenario two, where queries might depend on multiple Regions, the performance would be higher.

| username: tidb菜鸟一只 | Original post link

Is your unique_id in string format or numeric format?

| username: SnowFlakeLeaf | Original post link

It is in string format.
Example (the order from top to bottom may also be the actual writing order, and the overall pattern of the unique key cannot be guaranteed. The only pattern is that the prefix must be the order number of this row of data test):
【unique_id】 【order_no】
【test-sh-100 】【test】
【test-bj-010】【test】
【test-sh-90】【test】
【test-sh-110 】【test】

| username: tidb菜鸟一只 | Original post link

However, in your scenario, you use the field unique_id as the primary key, but it cannot be used as a key. TiDB will automatically generate _tidb_rowid as the key value. If you don’t use the shard_row_id_bits feature, indeed, these data will be in one region, but it will cause a write hotspot issue, right?

| username: SnowFlakeLeaf | Original post link

Yes, we do not intend to use shard_row_id_bits. We plan to use the primary key as the clustered index. The write hotspot issue should not exist because, in our business scenario, multiple pieces of data with the same order_no will not be generated in a short period. There will definitely be a significant time interval between their generation.

| username: tidb菜鸟一只 | Original post link

In that case, scenario one might be better, but scenario two also has its advantages. The data might actually be stored across multiple regions and multiple nodes, meaning it can be retrieved in parallel through multiple TiKV nodes, so it shouldn’t be particularly slow either.

| username: SnowFlakeLeaf | Original post link

However, there are some questionable details, such as when using LIKE and only providing a prefix, can TiDB quickly locate the regions where this data is stored? I’m not quite sure about the routing logic for prefix matching with clustered indexes compared to regular indexes during queries.

| username: tidb菜鸟一只 | Original post link

This usage should be similar to Point_Get, directly querying the data on TiKV through the key and then returning it. For a normal index, it first obtains the corresponding data key through the index, then queries the data on TiKV through the key, and then returns it.

| username: SnowFlakeLeaf | Original post link

I still don’t quite understand the principle of prefix matching search for primary keys… When the primary key is not fully determined, how is the region located? I haven’t found any mention of a similar prefix Bloom filter structure in the TiDB documentation.

| username: tidb菜鸟一只 | Original post link

You can create a table to test it. In my test, using a string primary key as a clustered index, a prefix like query will not require a table scan and will directly retrieve the key.

| username: SnowFlakeLeaf | Original post link

Okay, can we see how many regions were actually scanned in the execution plan? Thank you for helping us test. We have now decided to create 2 tables and import data at the billion level to compare various metrics. Besides performance, we may also need to evaluate the stability of the cluster when using unique keys as primary keys in actual business scenarios. The actual differences may be related to many aspects, so we need to run some real case scenarios.

| username: tidb菜鸟一只 | Original post link

The number of keys scanned will be returned, but the number of regions is not visible. However, practical experimentation is the best way to gain true knowledge and make an accurate assessment.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.