Analysis of Whether the Cause of Query Performance Degradation in TiDB Due to Excessive Historical Versions is Correct

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 历史版本过多导致tidb 查询性能下降到原因分析不知道是否正确

| username: Raymond

We know that when RocksDB stores data using MVCC, it saves multiple versions of data for a single key. If there are many MVCC versions, it can lead to a decline in the performance of range queries on secondary indexes because it needs to scan a lot of expired data. The answer to this problem might be as follows:

  1. For range queries on secondary indexes, RocksDB needs to use the seek method to locate the first key. Although the newer version can be quickly found by placing it at the front, if it needs to scan the remaining keys, it has to keep calling the next method to traverse all historical versions of this key before reaching the next key. This results in scanning a large amount of expired version data. The reason for not continuing to call the seek method to scan the next key is that the cost of seek scanning is too high. After all, the seek search order is to find in the memory memtable first, and then in the disk’s SST file. If each key is searched using the seek method, the performance will drop significantly. In contrast, calling the next method has relatively higher performance.
| username: xfworld | Original post link

With many versions, to retrieve the needed data from them, seeking is the only option.

This is closely related to the characteristics of RocksDB, where the levels will affect the speed of data retrieval.

It is not recommended to store too many historical versions. It is advisable to initiate Compaction in a timely manner to clean up historical versions.

| username: 人如其名 | Original post link

Personal understanding + speculation:
I don’t think that’s the case. If there is a large amount of expired data, and assuming it can be skipped via seek, then in many scenarios, seek would be more efficient than next. After all, seek is akin to an index lookup (logical key lookup), while next is similar to sequential scanning (next key’s physical pointer). However, you don’t know what the next data from seek will be; you can only determine it through next.

For example, the kv structure of a secondary index is: index_prefix_index_field_rowID_version->null. Suppose the index field is a date, and the data to be retrieved is an equality condition with the value ‘20220517’, and there are many duplicate records. When retrieving in TiKV, you can only locate the first position through prefix_seek (index_prefix_index_field), (although the index key is index_prefix_index_field_rowID, you can’t construct this key because you don’t know the rowID), and then return the data to TiKV for MVCC judgment through next until the result from prefix_seek is invalid (isvalid=false). Suppose the data to be retrieved is in(‘20220517’,‘20220518’) multiple values, then after iterating through ‘20220517’ with next, it will try to continue with next (specifically, it will peek a few times, which needs to be tested, seems like 10 times) to see if it can traverse to the key record of ‘20220518’. If it can, it means the distance between these two keys is very close, avoiding seek. Suppose the data to be retrieved is a range condition, it is equivalent to the equality condition.

After obtaining the rowID through the secondary index, you can construct Key_Version with {rowID_version (ts passed by TiDB)} and then directly locate the first position greater than or equal to this Key_Version through the SeekPrefix(Key_Version) API of RocksDB.

Therefore, in the case of multiple historical versions on the secondary index, it is indeed helpless; you can only use next to get all the data for TiKV to judge. For multiple index values, if the distance between the keys of multiple index values is relatively far, the number of seeks needs to be increased. For unique indexes and primary keys (rowID), multiple equality queries (common in index back table) only need to find the first one that meets the MVCC version, because it is confirmed that there is only one row, so even if there are many historical versions, there is no need to read them.

So, there is a typical scenario: delete from t where index_date < ‘xx’ limit 10000; this scenario of generating historical versions while retrieving secondary indexes can easily amplify this problem. The more historical versions of the index towards the end, the more you need to scan from the first index key that meets the condition, and then next through all the records that meet the condition, causing it to get slower and slower, and the CPU usage to increase. Optimization: If you set the range through between and + limit to define the upper and lower bounds for batch deletion before starting to delete data, you can reduce a lot of invalid scans, only scanning some expired data locally (between and), greatly reducing the scan volume. Better optimization: If you find out all the rowIDs of the records that meet the conditions before deleting the data, and then sort them (the purpose of sorting is to make the rowIDs more continuous, using next to quickly retrieve data), and associate the sorted rowIDs with the records to be deleted in batches, you won’t scan expired data. Product optimization: If the amount of data to be deleted is too large, you need to store all the rowIDs and then associate them in batches. Storing rowIDs, batch numbers, etc., might be time-consuming, so dividing multiple upper and lower bounds through rowID + batch + where condition range will greatly reduce the data volume, allowing it to be deleted in batches in memory. The current batch DML provided by the official is this idea, and if parallel capability is implemented later, it will be even stronger.

If RocksDB could provide a next_key(offset, limit) interface (which means pre-requesting RocksDB to return the next continuous limit keys starting from the offset), then the efficiency of handling such expired data might be greatly improved. For example, next_key(5,1), next_key(10,1), next_key(max_offset=1024,1), etc., to try to skip some expired data, reducing the data transmission from RocksDB to TiKV, and also returning multiple keys to TiKV at once, reducing the number of interactions.

Similarly, imagine providing a next(offset, limit) interface for full table scans, which would also have a good effect. :laughing:

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.