Secondary Index (Help)

translator_bot · June 21, 2024, 4:02am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 二级索引（求助）

| username: Trouble

I am a TiDB beginner and have some questions about secondary indexes. I hope to get some answers.

CREATE TABLE User (
    ID int,
    Name varchar(20),
    Role varchar(20),
    Age int,
    PRIMARY KEY (ID),
    KEY idxAge (Age)
);

1, "TiDB", "SQL Layer", 10
2, "TiKV", "KV Engine", 20
3, "PD", "Manager", 30

t10_r1 --> ["TiDB", "SQL Layer", 10]
t10_r2 --> ["TiKV", "KV Engine", 20]
t10_r3 --> ["PD", "Manager", 30]

t10_i1_10_1 --> null
t10_i1_20_2 --> null
t10_i1_30_3 --> null

At this point, if I query age=10, is the SQL execution process as follows:

tidbserver parses the SQL
Queries the index content about age on PD, locating t10_i1_10 on a certain region
Sends the execution plan to that region, gets the corresponding rowid as 1, and returns rowid=1 to the tidbserver layer
tidbserver requests PD again to get the region where rowid=1 is located
Queries the data of rowid=1 in that region

Can you please check if my understanding of steps 3, 4, and 5 in the execution process is correct? Thank you very much!!!

translator_bot · June 21, 2024, 4:02am

| username: forever | Original post link

Yes, looking at the execution plan, the task of IndexLookUp_10 is on TiKV. After retrieving the ID, the IndexLookUp_10 operator on the root, which is the TiDB server, uses TableRowIDScan_9 under IndexLookUp to get the result data.

translator_bot · June 21, 2024, 4:02am

| username: Trouble | Original post link

For the current query, tidbserver will query pd twice:
The first query to pd retrieves the region where the index value is located and reads the rowid, then returns it to tidbserver;
The second query to pd retrieves the region where the rowid is located, then reads the data and returns it to tidbserver;
Is this description correct?

translator_bot · June 21, 2024, 4:02am

| username: changpeng75 | Original post link

After obtaining the rowid from the first query, there is no need to return it to TiDB. The data can be queried directly on TiKV because it is impossible for the data to be on other regions. In fact, this can be seen from the execution plan, as the coprocessor process does not need to interact with TiDB.

translator_bot · June 21, 2024, 4:02am

| username: zhanggame1 | Original post link

For a single table query, TiKV finds the rowid through the secondary index, then TiKV retrieves the data and returns it to TiDB.

translator_bot · June 21, 2024, 4:02am

| username: Jellybean | Original post link

To return to TiDB, and in TiDB-server, it is not necessary to request PD every time to get the location of the region. To ensure efficiency, there is a cache for region and other metadata. TiDB-server first retrieves the region location from its own cache, and after obtaining the location, it directly accesses the data. This is the case in the vast majority of situations. In rare cases, if it finds that the location is incorrect after going to TiKV (due to region merge or split), it will request the latest location from PD again and update the cache.

Accessing a normal index will trigger an indexLookUp to access the table. By obtaining the row data’s _tidb_row_id through the normal index, it then goes back to the table to fetch the entire row of data based on _tidb_row_id. This process is somewhat similar to MySQL, where it first searches the secondary index tree, then searches the clustered index tree, and accesses the leaf nodes to retrieve the data.

translator_bot · June 21, 2024, 4:02am

| username: Trouble | Original post link

Thank you for the explanation regarding PD access and caching knowledge.
In MySQL, searching the index first and then searching for data based on the primary key is all done on the storage engine side.
My confusion is whether TiDB needs to first return the result to the TiDB server layer after obtaining the rowid from the TiKV index, and then the TiDB server layer fetches the data row from TiKV again based on the rowid.
Is the process as follows:
client->tidbserver->pd(cache)->tikv(index)->tidbserver->pd(cache)->tikv(table)

translator_bot · June 21, 2024, 4:02am

| username: dba远航 | Original post link

Kudos to your learning attitude.

translator_bot · June 21, 2024, 4:02am

| username: 有猫万事足 | Original post link

It needs to be read twice.
But the somewhat vague part is pd(cache), to be precise, this is a region cache.
The implementation of this region cache is in

TiDB accesses TiKV through this component. You can even use this component alone to use TiKV as a KV store rather than a database.
Within this component, it maintains a separate region cache.
The maintenance method of this region cache is as @Jellybean mentioned. This cache will only fetch from PD once during initialization, and subsequently, it will only invalidate the corresponding region in the cache and fetch the region information from PD again when the region cannot be accessed or after a certain period.
This avoids the overhead of frequently accessing PD to fetch region information.

For more details, you can check the following link:

translator_bot · June 21, 2024, 4:02am

| username: Jellybean | Original post link

Yes. The result needs to be returned to the TiDB server layer first, and then the TiDB server layer will fetch the data row from TiKV based on the rowid.

You can see in the execution plan under the task column, during IndexLookUp it is in root, which indicates that it has returned to the TiDB server layer.

translator_bot · June 21, 2024, 4:02am

| username: TIDB-Learner | Original post link

The TiDB instance will cache some metadata locally. If the result data cache exists, there is no need to access PD. If it does not exist, it will access PD and cache the latest data in TiDB. Is this the correct understanding?

translator_bot · June 21, 2024, 4:02am

| username: Trouble | Original post link

Okay, thank you very much!

translator_bot · June 21, 2024, 4:02am

| username: 有猫万事足 | Original post link

Yes, that’s roughly it. For details, see

translator_bot · June 21, 2024, 4:02am

| username: zhanggame1 | Original post link

TiDB caches the region partition information. When SQL is executed, it is parsed into region access and then sent to the corresponding TiKV. If TiKV cannot find it, TiDB will query PD.

translator_bot · June 21, 2024, 4:02am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.