Choosing Primary Keys in TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于tidb的主键选择

| username: 突破边界

There are four choices for the primary key in TiDB:

  1. AUTO_INCREMENT
  2. UUID
  3. AUTO_RANDOM
  4. Not setting, TiDB will generate a hidden primary key by default

I have two questions:

  1. According to the documentation, the latter three options help avoid read-write hotspot issues. Which one is better to use? Which one is more effective in avoiding hotspot issues, and which one has higher performance? Has anyone done performance testing?
  2. The AUTO_RANDOM attribute is unique to TiDB. When I use common database tools to export the table creation SQL, the SQL statement does not include AUTO_RANDOM. Are there any database client tools on the market that support this attribute?

I hope to get your help, thank you!

| username: xfworld | Original post link

  • Question 1:

    • Scenario Characteristics: The scale of support for business scenarios varies greatly, but more business systems need to consider compatibility. Therefore, to ensure the consistency of data before and after migration, the handling of primary keys must be considered.
    • Conclusion: Please make the appropriate choice based on your business scenario. If compatibility issues are not a concern, you can adopt the best-performing solution. Otherwise, a lot of compatibility handling is required, even if done manually…
  • Question 2:
    Most tools on the market support this. TiDB features are described using comments, which take effect as long as comments are enabled before initiating the client connection. The official recommendation is a tool called mycli. Please consider it.

| username: 突破边界 | Original post link

Thank you for your answer. Continuing with my inquiry, in a newly developed scenario without historical data compatibility issues, which solution offers the best performance?

| username: xfworld | Original post link

The performance issue depends on one thing: whether the data can be evenly distributed across each node for reading and writing, avoiding write skew and read skew. In simple terms, it means not letting all requests hit a single node instance. This way, the advantage of distributed processing turns into a single machine, which is the biggest problem point.

So what is the core associated with this problem point?
Answer: Hotspot

Please refer to the following documents:

Actually, the official documentation has already provided the answer, but this answer is still related to the scenario. You need to know the approximate data scale to achieve better data sharding. However, even if you are not very clear initially and define an insufficient scale, it doesn’t matter much. You can manually adjust to achieve this automatic sharding capability. Please refer to AUTO_RANDOM | PingCAP 文档中心

Earlier versions did not have the auto-random feature. To ensure it adapts to more and better scenarios and can be used simply, this feature was gradually formed. :+1:

| username: TIDB-Learner | Original post link

Purely theoretical advice: avoid UUID and autoincrement, and it is recommended to use autorandom. However, there are still many systems using UUID as the primary key.

| username: zhanggame1 | Original post link

First, you need to determine if your IO is particularly high and if there will be a bottleneck. If not, use AUTO_INCREMENT.
Secondly, check the number of TiKV instances. If there are only 3 replicas and 3 TiKV instances, then using AUTO_RANDOM doesn’t make much sense.

| username: TiDBer_CkS2lbTx | Original post link

In the past, the tables migrated from MySQL were manually scattered after being migrated with AUTO_INCREMENT. The newly created ones later all used AUTO_RANDOM, although they no longer have the auto-increment attribute.

| username: 有猫万事足 | Original post link

  1. The most recommended is AUTO_RANDOM. Using this setting for the primary key in a clustered table will reduce one back-and-forth operation per read/write based on the primary key, resulting in better performance. However, this is not very friendly for upstream and downstream migration scenarios.
  2. AUTO_INCREMENT will definitely have hotspots if the table is not modified. For large tables, you must use a non-clustered table with SHARD_ROW_ID_BITS modification. Reads/writes based on the primary key will have one more back-and-forth operation compared to AUTO_RANDOM, but it is more migration-friendly.
  3. For using UUID as the primary key, refer to the best practices at UUID 最佳实践 | PingCAP 文档中心.
  4. If not set, TiDB will generate a hidden primary key by default.
    The claim that not setting it will eliminate hotspots is unfounded.
    Overall, to eliminate hotspots, use clustered tables with AUTO_RANDOM, or non-clustered tables with SHARD_ROW_ID_BITS and PRE_SPLIT_REGIONS.
| username: 小龙虾爱大龙虾 | Original post link

  1. AUTO_INCREMENT
    In concurrent write scenarios, row data will be concentrated in a single region, causing hotspots, so it is not suitable for high-concurrency write scenarios.

  2. UUID
    It’s a string and quite long at 36 characters. Personally, I feel the performance is not as good as numbers, but you can test it.

  3. AUTO_RANDOM
    This feature was introduced to solve the hotspot issue in high-concurrency write scenarios. It can only solve row data hotspots, not index hotspots.

  4. No setting, TiDB generates a hidden primary key by default
    This default is also auto-increment. You can use the shard rowid bit feature to solve the hotspot issue. The principle is similar to auto random, but it also cannot solve index hotspot issues.

  5. The documentation mentions that the latter three methods help avoid read/write hotspot issues. So, which one is better? Which one is more effective in avoiding hotspot issues, and which one has higher performance? Has anyone done stress testing?
    Personally, I feel the best should be AUTO_INCREMENT. A clustered table writes one less key per row, which can avoid hotspot issues.

  6. The AUTO_RANDOM attribute is unique to TiDB. When I use common database tools to export the create table SQL, the create table SQL statement does not include AUTO_RANDOM. Are there any database client tools on the market that support this attribute?
    I use DBeaver to view the table DDL, and it can be seen. It should be visible through the show create table statement.

| username: zhaokede | Original post link

Determine based on the business requirements.
For example, for small configuration tables where data is almost static, you can choose AUTO_INCREMENT or UUID;
For tables with large amounts of data and frequent writes, using AUTO_RANDOM is more suitable as it can solve write hotspots.

| username: wluckdog | Original post link

AUTO_INCREMENT: It is recommended for small tables with low data volume and infrequent calls, where hotspot blocks are unlikely to occur.
UUID: As a varchar type, it is not recommended for large tables because the primary key is long, takes up a lot of space, and requires time for searching.
AUTO_RANDOM: Can be used to avoid hotspots, but has the drawback that manually inserted data with values requires parameter modification.
If not set, TiDB will generate a hidden primary key by default. Generally, it is recommended to set a primary key. However, varchar type primary keys are similar to MySQL’s secondary indexes and will call back to the hidden, invisible internal index.

For varchar type primary keys, when the data volume is large, using SHARD_ROW_ID_BITS to scatter in non-clustered tables is sufficient.
For bigint type primary keys in clustered tables, it is recommended to use only random numbers. Modify the application to generate random data for insertion.

| username: mono | Original post link

If you have high-performance SSDs, in most cases, using auto_increment is sufficient. If you expect that a single SSD won’t handle the write load, then use auto_random.

| username: Kongdom | Original post link

:flushed: Not considering the business primary key at all?

| username: mono | Original post link

It is generally not recommended to use business-related fields as primary keys.

| username: zhanggame1 | Original post link

Fields with business significance are also fine, don’t be too rigid. Using clustered tables with business-significant fields can reduce the need for a secondary index, making both writes and queries faster.

| username: TiDBer_小阿飞 | Original post link

Business primary keys cannot replace non-business primary keys.
Because business primary keys need to contain business meaning, they can only be of varchar type and are generated according to certain rules, which are inevitably unordered. These two points are the reasons why they are not suitable as primary keys for table data. This also determines that business primary keys cannot replace non-business primary keys as the true primary keys of a data table.

| username: zhanggame1 | Original post link

Where does it say it can only be of varchar type?
Generated according to certain rules, it must be unordered. Since there are rules, it must be ordered. The rules can’t be a random algorithm, right?

We use a bunch of fields as a composite primary key, and there’s no problem with that.

| username: Kongdom | Original post link

:joy: That might really be different. Our development standard here is to use business keys whenever possible because the speedup from a primary key query makes it worth using business keys. :yum:

| username: Kongdom | Original post link

:thinking: Why? For example, using the most commonly used business order number as the primary key, querying it should definitely be faster than using it as an index. When I used SQL Server before, the requirement was to use a clustered unique index for business primary keys and a non-clustered primary key for GUIDs.

| username: Kongdom | Original post link

:thinking: Here is a question, I remember that the choice of primary key affects read and write performance. Business primary keys have better read performance, while random primary keys have better write performance?