Significant query time difference between sorting by ID and non-ID fields in TiDB tables created based on MySQL methods. Is it a table creation SQL issue or something else? (Both ID and non-ID fields have indexes; occurs in both row and column storage)

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 通过id排序查询和非id字段查询时间有很大差异,测试tidb中表都有这种情况,一直基于MySQL方式创建表,是创建表SQL问题还是其他问题?(id和非id字段都有索引)(走行存和列存都有这种情况)

| username: xiaoxiaozuofang

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: There is a significant difference in query time between sorting by id and querying by non-id fields. This situation is observed in all tables tested in TiDB. Tables were always created based on MySQL methods. Is it an issue with the table creation SQL or something else? (Both id and non-id fields have indexes)
The query SQL and table creation statement are as follows:
EXPLAIN ANALYZE SELECT
id,
bid_id,
UNIX_TIMESTAMP(publish_time) AS publish_stamp,
title,
views,
bid_summary
FROM
Call_Accounting.news
WHERE
STATUS = 1
AND 1 = 1
AND city = 500200
AND s_subscopeclass = 0701
ORDER BY
publish_time DESC
LIMIT
0, 4;

Table creation statement:

CREATE TABLE news (
id bigint(20) NOT NULL AUTO_INCREMENT COMMENT ‘Identifier’,
title text DEFAULT NULL COMMENT ‘News Title’,
detail text DEFAULT NULL COMMENT ‘News Details’,
href varchar(200) DEFAULT NULL COMMENT ‘Source URL’,
bid_id varchar(50) DEFAULT NULL COMMENT ‘Bid Identifier ID’,
publish_time datetime DEFAULT NULL COMMENT ‘Publish Time’,
toptype varchar(50) DEFAULT NULL COMMENT ‘Primary Information Type’,
subtype varchar(50) DEFAULT NULL COMMENT ‘Secondary Information Type’,
s_topscopeclass varchar(50) DEFAULT NULL COMMENT ‘Primary Industry Classification’,
s_subscopeclass varchar(50) DEFAULT NULL COMMENT ‘Secondary Industry Classification’,
budget double DEFAULT NULL COMMENT ‘Budget Amount’,
bidamount double DEFAULT NULL COMMENT ‘Bid Amount’,
area varchar(30) DEFAULT NULL COMMENT ‘Province’,
city varchar(50) DEFAULT NULL COMMENT ‘City’,
views int(11) DEFAULT NULL COMMENT ‘Views’,
page_title varchar(300) DEFAULT NULL COMMENT ‘Page Title’,
bid_keywords varchar(100) DEFAULT NULL COMMENT ‘Bid Keywords’,
bid_summary text DEFAULT NULL COMMENT ‘Bid Summary’,
status tinyint(4) DEFAULT ‘0’ COMMENT ‘0 Not Usable\n1 Usable\n2 Deleted Information’,
district varchar(50) DEFAULT NULL COMMENT ‘District’,
copeclass varchar(20) DEFAULT NULL COMMENT ‘Goods, Services, Engineering’,
buyer varchar(100) DEFAULT NULL COMMENT ‘Purchasing Unit’,
comeintime datetime DEFAULT NULL COMMENT ‘Storage Time’,
winner text DEFAULT NULL COMMENT ‘Winning Unit’,
PRIMARY KEY (id) /*T![clustered_index] CLUSTERED */
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=4530001;

]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: Jasper | Original post link

Since the id is the primary key and the primary key itself is ordered, sorting based on id utilizes the orderliness of the index. From the execution plan execution info, you should be able to see keep order: true. Sorting with other columns involves additional sorting operations, so it will be slower.

| username: tidb菜鸟一只 | Original post link

It looks like this table should be a clustered index table. Its key itself is composed of id, and adjacent ids basically exist in the same region. Querying by id in order will definitely be faster.

| username: zhanggame1 | Original post link

There is a significant difference between sorting queries by ID and non-ID fields if it is a clustered table. The data in a clustered table is ordered, so when sorted by ID, the database knows where the data is stored on the disk as soon as it gets the ID. For non-ID fields with indexes, it needs to first look up the index, find the address where the data is stored from the index, and then search the table. From an I/O perspective, this adds at least one more I/O operation.

| username: redgame | Original post link

Is the data distribution uneven?

| username: ealam_小羽 | Original post link

I think it would be helpful to post the specific execution plan.

| username: cy6301567 | Original post link

In TiDB, the significant difference in query time between sorting by ID and querying by non-ID fields may be due to various factors. Here are some possible explanations:

  1. Primary Key Index: Typically, the primary key of a table is used as the primary key index, which means sorting queries by ID. If your table’s primary key index is well-configured, then sorting queries by ID may be very fast because the primary key index is an efficient data structure.
  2. Other Indexes: Queries on non-ID fields may involve other types of indexes. If these indexes are not well-maintained or the query involves unevenly distributed data, it may lead to decreased query performance.
  3. Uneven Data Distribution: If the data in the table is unevenly distributed, queries on non-ID fields may require reading data across multiple nodes, affecting performance. This could be related to the sharding method in a distributed architecture.
  4. Data Type and Index Size: Non-ID fields may contain more data, and the index size may be larger, which can also affect query performance. Larger indexes require more memory for query optimization and processing.
  5. Query Plan: The query optimizer may choose different strategies when generating the query plan, depending on the specific structure and conditions of the query. Different query plans can lead to performance differences.
| username: TiDBer_vfJBUcxl | Original post link

Primary key indexes are faster. Analysis: The InnoDB search engine’s indexes are implemented using a B+Tree data structure. When using a unique index (or other index queries), the unique index (or other index) is first used to find the leaf node (the primary key is stored on the leaf node), then the primary key index is used to find the data page on the leaf node (this operation is called a table lookup), and then the data is returned to the Server layer. However, when querying with a primary key index, the data page on the leaf node is directly found and returned to the Server layer, eliminating the table lookup step. Therefore, querying with a primary key index is faster.

| username: TiDBer_vfJBUcxl | Original post link

In MySQL, there are multiple levels of indexes, and different levels of indexes have a significant impact on query efficiency. The following is the efficiency ranking of MySQL index levels.

Clustered Index: In the case of a primary key, the physical order of the records is arranged according to the order of the primary key. Because the clustered index defines the physical order of the data, it is the most efficient when querying using the primary key.

Secondary Index: Also known as a non-clustered index, it is independent of the physical order of the table. It stores key values and row pointers to the corresponding records. Therefore, when querying using a secondary index, MySQL needs to first look up the secondary index and then use the pointer to find the table, making it slightly less efficient than a clustered index.

Full-Text Index: Full-text indexes based on the MyISAM engine can improve the efficiency of retrieving data in large text fields, but they do not support prefix queries with wildcards and are not suitable for the InnoDB storage engine.

Spatial Index: MySQL supports the use of spatial data types and spatial functions, as well as spatial indexes. Spatial indexes can include queries for fields with spatial data types.

The index of the ID segment is a clustered index, while non-ID segments are secondary indexes. Each table has only one clustered index, and even if multiple fields are used as the primary key, only one field will be selected as the clustered index.

| username: zhanggame1 | Original post link

This is talking about TiDB.