Personal understanding + speculation:
I don’t think that’s the case. If there is a large amount of expired data, and assuming it can be skipped via seek, then in many scenarios, seek would be more efficient than next. After all, seek is akin to an index lookup (logical key lookup), while next is similar to sequential scanning (next key’s physical pointer). However, you don’t know what the next data from seek will be; you can only determine it through next.
For example, the kv structure of a secondary index is: index_prefix_index_field_rowID_version->null. Suppose the index field is a date, and the data to be retrieved is an equality condition with the value ‘20220517’, and there are many duplicate records. When retrieving in TiKV, you can only locate the first position through prefix_seek (index_prefix_index_field), (although the index key is index_prefix_index_field_rowID, you can’t construct this key because you don’t know the rowID), and then return the data to TiKV for MVCC judgment through next until the result from prefix_seek is invalid (isvalid=false). Suppose the data to be retrieved is in(‘20220517’,‘20220518’) multiple values, then after iterating through ‘20220517’ with next, it will try to continue with next (specifically, it will peek a few times, which needs to be tested, seems like 10 times) to see if it can traverse to the key record of ‘20220518’. If it can, it means the distance between these two keys is very close, avoiding seek. Suppose the data to be retrieved is a range condition, it is equivalent to the equality condition.
After obtaining the rowID through the secondary index, you can construct Key_Version with {rowID_version (ts passed by TiDB)} and then directly locate the first position greater than or equal to this Key_Version through the SeekPrefix(Key_Version) API of RocksDB.
Therefore, in the case of multiple historical versions on the secondary index, it is indeed helpless; you can only use next to get all the data for TiKV to judge. For multiple index values, if the distance between the keys of multiple index values is relatively far, the number of seeks needs to be increased. For unique indexes and primary keys (rowID), multiple equality queries (common in index back table) only need to find the first one that meets the MVCC version, because it is confirmed that there is only one row, so even if there are many historical versions, there is no need to read them.
So, there is a typical scenario: delete from t where index_date < ‘xx’ limit 10000; this scenario of generating historical versions while retrieving secondary indexes can easily amplify this problem. The more historical versions of the index towards the end, the more you need to scan from the first index key that meets the condition, and then next through all the records that meet the condition, causing it to get slower and slower, and the CPU usage to increase. Optimization: If you set the range through between and + limit to define the upper and lower bounds for batch deletion before starting to delete data, you can reduce a lot of invalid scans, only scanning some expired data locally (between and), greatly reducing the scan volume. Better optimization: If you find out all the rowIDs of the records that meet the conditions before deleting the data, and then sort them (the purpose of sorting is to make the rowIDs more continuous, using next to quickly retrieve data), and associate the sorted rowIDs with the records to be deleted in batches, you won’t scan expired data. Product optimization: If the amount of data to be deleted is too large, you need to store all the rowIDs and then associate them in batches. Storing rowIDs, batch numbers, etc., might be time-consuming, so dividing multiple upper and lower bounds through rowID + batch + where condition range will greatly reduce the data volume, allowing it to be deleted in batches in memory. The current batch DML provided by the official is this idea, and if parallel capability is implemented later, it will be even stronger.
If RocksDB could provide a next_key(offset, limit) interface (which means pre-requesting RocksDB to return the next continuous limit keys starting from the offset), then the efficiency of handling such expired data might be greatly improved. For example, next_key(5,1), next_key(10,1), next_key(max_offset=1024,1), etc., to try to skip some expired data, reducing the data transmission from RocksDB to TiKV, and also returning multiple keys to TiKV at once, reducing the number of interactions.
Similarly, imagine providing a next(offset, limit) interface for full table scans, which would also have a good effect. 