Querying Table Metadata is Particularly Slow

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 查询表元信息特别慢

| username: GreenGuan

May I ask why it takes 10 seconds to return the result for show table status from db and also 10 seconds for show table status like '%t1%'? Do you have any optimization ideas for this issue?

| username: Billmay表妹 | Original post link

This issue may be caused by the large amount of data in the table. Consider the following optimization measures:

  1. Increase the resources of the TiDB cluster, such as adding TiDB nodes, PD nodes, TiKV nodes, etc., to improve the processing capacity of the entire cluster.

  2. Adjust the parameters of the TiDB cluster, such as adjusting the number of connections for TiDB, and adjusting the RaftStore and RocksDB parameters for TiKV, to enhance the performance of the entire cluster.

  3. Partition or shard the table to reduce the amount of data in a single table, thereby improving query performance. You can use TiDB’s partitioned table or sharded table features to partition or shard the table according to certain rules.

  4. Archive the table, moving historical data to other storage media, thereby reducing the amount of data in a single table and improving query performance.

  5. Adjust the query statements and optimize the query plan, such as using indexes and avoiding full table scans, to improve query performance.

It is necessary to analyze and optimize based on the specific situation. It is recommended to first analyze the execution plan of the query statement through explain, identify the bottleneck, and then perform the corresponding optimization.

| username: GreenGuan | Original post link

I traced the SQL and found that when querying information_schema.tables, the number of regions that need to be accessed is very large. Below are the screenshots of normal and abnormal access.

I have a question: is i_s.tables stored in memory?

  1. If so, why do we still need to send RPC requests to the regions?
  2. If not, does accessing the i_s.tables table require scanning the metadata of all tables under the instance? Or what is its access logic?

Screenshot of normal access to i_s.tables

Screenshot of slow access to i_s.tables

| username: Billmay表妹 | Original post link

In TiDB, the information_schema.tables table is a system table used to store metadata of all tables in TiDB. When querying the information_schema.tables table, TiDB generates a query plan based on the table’s metadata and sends RPC requests to the corresponding regions to fetch data.

Regarding the normal and abnormal access situations you mentioned, it might be due to certain regions in the TiDB cluster becoming hotspots. When a region becomes a hotspot, it leads to a high volume of requests on that region, affecting the overall performance of the TiDB cluster. Therefore, it is necessary to promptly identify and handle hotspot regions to ensure the stability and performance of the TiDB cluster.

As for the access logic of the information_schema.tables table, TiDB stores the metadata of this table in PD and retrieves the metadata from PD during queries. Specifically, when querying the information_schema.tables table, TiDB sends a request to PD to get the metadata of all tables and generates a query plan based on the query conditions. Then, TiDB sends RPC requests to the corresponding regions to fetch the data that meets the conditions and returns it to the client.

It is important to note that since the metadata of the information_schema.tables table is stored in PD, querying this table might impose some load on PD. To reduce the load on PD, you can consider using a cache to store the metadata of the information_schema.tables table, thereby reducing the number of accesses to PD.

| username: GreenGuan | Original post link

Could you please confirm whether this code needs to traverse the cache information on all TiDB nodes? If there are too many TiDB compute nodes, will it cause the phenomenon of slow query metadata?

	if cache, tbl := e.ctx.GetSessionVars().InspectionTableCache, e.table.Name.L; cache != nil &&
		e.isInspectionCacheableTable(tbl) {
		// TODO: cached rows will be returned fully, we should refactor this part.
		if !e.cacheRetrieved {
			// Obtain data from cache first.
			cached, found := cache[tbl]
			if !found {
				rows, err := e.retriever.retrieve(ctx, e.ctx)
				cached = variable.TableSnapshot{Rows: rows, Err: err}
				cache[tbl] = cached
			}
			e.cacheRetrieved = true
			rows, err = cached.Rows, cached.Err
		}
	} else {
		rows, err = e.retriever.retrieve(ctx, e.ctx)
	}
| username: Billmay表妹 | Original post link

This code is used to check whether the table metadata can be retrieved from the cache. If it can, the data is fetched from the cache. If there is no data in the cache, the data is retrieved from storage and stored in the cache for future use.

This code does not need to traverse the cache information on all TiDB nodes because the cache is stored on the current TiDB node. Therefore, the number of TiDB nodes does not affect the speed of querying metadata.

It should be noted that if there is no data in the cache, data needs to be retrieved from storage, which may cause some delay. However, this delay is usually acceptable and can be reduced by increasing the cache size.

| username: GreenGuan | Original post link

From the code, if the cache is not hit, it goes to this logic: rows, err := e.retriever.retrieve(ctx, e.ctx), and then the data retrieved from the disk is cached by TiDB (e.cacheRetrieved = true). What parameter can be used to extend the cache duration of the metadata?

| username: GreenGuan | Original post link

Is there any latest feedback?

| username: knull | Original post link

Looking at this code, it seems it won’t go to the cache? Is it mainly because there are too many tables? That’s why there are so many matches and requests.

| username: GreenGuan | Original post link

It’s okay, more than 300 tables, not particularly exaggerated.

Upon observation, the time taken in the screenshot above is also around 10 seconds. I tried using BR to backup/restore the entire database from the production cluster to the validation TiDB cluster, both are version 5.4.1, and the data volume and number of tables in both clusters are consistent. The validation cluster does not have this issue.

| username: zhanggame1 | Original post link

The data distribution is different.

| username: h5n1 | Original post link

Check this output using tikv-ctl --host tikv_ip:20160 region-properties -r 671883302.