Several Issues with GC and Compaction

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: GC和compact的几个问题

| username: mydb

【TiDB Usage Environment】Production Environment / Test / Poc
Production Environment

【TiDB Version】
v5.1.2

【Reproduction Path】What operations were performed when the issue occurred
【Encountered Issues: Issue Phenomenon and Impact】
【Issue 1】
After the key is in GC but before compaction, does SQL still need to scan these keys when searching? Or does it skip these keys directly? If it skips, how is it achieved or what is the principle?

【Issue 2】
You can use the following interface to find multi-version data of a row, but this data cannot distinguish whether it has been GC’d or not. Only after GC + compaction, the multi-version data retrieved through this interface will completely disappear.
curl http://192.168.1.1:10080/mvcc/key/dbname/table_name/343930187735040

Is there an interface to check whether the multi-version data has been GC’d?

【Issue 3】
bug-11217: Multi-key GC call causes GC to not work, leaving a large number of historical versions

Regarding this issue, we adopted the method of setting gc.enable-compaction-filter: false to disable TiKV’s compaction filter GC and use the old GC mode for multi-version GC.
However, the effect is not very good. Here is a specific example:
The following SQL was executed on November 29th, using a covering index to find 100 rows of data, where create_time<=1669267277441 translates to create_time<=‘2022-11-24 13:21:17’.
But we disabled the GC compaction filter and used the old GC method on November 23rd.
desc analyze SELECT id FROM table_xxx FORCE INDEX(idx_essyncstate_bizproduct) WHERE (biz_id=300 AND es_sync_state=1 AND product_id=300 AND create_time<=1669267277441) ORDER BY create_time DESC LIMIT 100\G
*************************** 4. row ***************************
id: └─Limit_22
estRows: 4.12
actRows: 189
task: cop[tikv]
access object:
execution info: tikv_task:{proc max:485ms, min:70ms, p80:412ms, p95:456ms, iters:46, tasks:43}, scan_detail: {total_process_keys: 189, total_keys: 34711190, rocksdb: {delete_skipped_count: 610, key_skipped_count: 34712398, block: {cache_hit_count: 29983, read_count: 1866, read_byte: 38.9 MB}}}
operator info: offset:0, count:100
memory: N/A
disk: N/A

【Issue 4】
show config where type=‘tikv’ and name like ‘%enable-compaction-filter%’;
set config tikv gc.enable-compaction-filter=false;
In version 5.1.2, turning off this parameter, can it completely solve the GC bug? Will there be other risks? (If a large number of clusters use the old GC method online)
The official recommendation is not to modify parameters online, so is it recommended to use tikv-ctl to disable the new GC method?

【Issue 5】
What do delete_skipped_count and key_skipped_count in the execution plan mean?
There are multiple different explanations for these two fields in various places, but I still don’t understand what they mean. Can you help explain them for better issue analysis?
Version 1
delete_skipped_count: Indicates that the key has been GC’d, status is tombstone, but not yet compacted.
key_skipped_count: Indicates that the same key has multiple MVCC versions, and the GC time has not passed.
Version 2
Rocksdb_delete_skipped_count: The number of deleted keys scanned during RocksDB data reading.
Rocksdb_key_skipped_count: The number of deleted (tombstone) keys encountered during RocksDB data scanning.
Version 3
In a post on asktug

Here is a test:
create table t123 (id int primary key, name varchar(100), age int, city varchar(100));
alter table t123 add index idx_name(name);
insert into t123 values (1,‘user1’,1,‘bj’);
insert into t123 values (2,‘user2’,2,‘bj’);
insert into t123 values (3,‘user3’,3,‘bj’);
insert into t123 values (4,‘user4’,4,‘bj’);
insert into t123 values (5,‘user5’,5,‘bj’);
insert into t123 values (6,‘user6’,6,‘bj’);
insert into t123 values (7,‘user7’,7,‘bj’);
insert into t123 values (8,‘user8’,8,‘bj’);
insert into t123 values (9,‘user9’,9,‘bj’);
insert into t123 values (10,‘user10’,10,‘bj’);

desc analyze select name from t123 where name=‘user1’\G

There is 1 row of name=‘user1’ data in the table, no updates or deletions have been performed, and part of the execution plan is as follows:
scan_detail: {total_process_keys: 1, total_keys: 2, rocksdb: {delete_skipped_count: 1, key_skipped_count: 2

There are 3 rows of name=‘user1’ data in the table, no updates or deletions have been performed, and part of the execution plan is as follows:
scan_detail: {total_process_keys: 3, total_keys: 4, rocksdb: {delete_skipped_count: 3, key_skipped_count: 6

Thank you all

【Resource Configuration】
【Attachments: Screenshots / Logs / Monitoring】

| username: Lucien-卢西恩 | Original post link

It depends on how the key is deleted. If it is a delete operation, the key will still be scanned before compaction, unless it is a drop/truncate table operation which does not need to wait for compaction.

This query shows the current version status. If it has been GC’d, it will not be displayed here.

It is recommended to upgrade the version, as the latest branch has already resolved this issue.

Disabling compaction-filter is equivalent to reverting to the initial GC mode, which practically affects the efficiency of GC cleanup. It is recommended to upgrade to a stable version with compaction-filter.

Refer to the introduction in version 2, it is the logic inside RocksDB. Seeing it in the execution plan can help estimate the impact on actual queries.

| username: mydb | Original post link

Thank you for the reply.

Question 1
In the second chapter of the article “The ‘Blood Case’ Triggered by Tombstone Key,” it says: after GC, before being compacted, optimization is done, and scanning is not required.

Question 2
This has been tested, and it exists in the current version. It only disappears after being compacted. Tests were conducted using versions 5.1.4 and 5.4.
But how to confirm whether the MVCC data has been GC’d? For example, if I insert a piece of data and update it 10 times, how can I confirm whether these 9 pieces of MVCC data have been GC’d?
If it can be confirmed that they have been GC’d, it will be easier to check whether these 9 pieces of data still exist before being compacted.

Question 5
The cluster enabled the old GC on November 23, and below are the recorded execution plans for the same SQL.

November 30, took 13.23 seconds
total_keys: 31732179, rocksdb: {delete_skipped_count: 135, key_skipped_count: 31735728

December 1, took 3.28 seconds
total_keys: 7873311, rocksdb: {delete_skipped_count: 230, key_skipped_count: 7874853

December 2, took 12.59 seconds
total_keys: 13972030, rocksdb: {delete_skipped_count: 14220527, key_skipped_count: 41315019

December 5, took 0.93 seconds
total_keys: 2166513, rocksdb: {delete_skipped_count: 703, key_skipped_count: 2179817

| username: Lucien-卢西恩 | Original post link

This is determined based on gc_life_time. GC will clean up MVCC version data before the gc_life_time timestamp. Currently, the method to view MVCC versions is through the TiDB API query.

Apart from this part of the execution time, are all other parts the same? Looking at these 5 records, the actual total key data queried each time is different. What about the process key data? What is the expected result of this SQL query?

| username: mydb | Original post link

Apart from this part of the time consumption, the other parts took less time, so no records were made.
Since total_process_keys is basically around 100, no records were made. Below is one of the records:
December 6th, time taken: 0.79 seconds
total_process_keys: 173, total_keys: 1835642, rocksdb: {delete_skipped_count: 417, key_skipped_count: 1837387

| username: h5n1 | Original post link

My understanding:

  1. RocksDB uses an iterator to scan the memtable and SST, performing prefix_seek based on the key. The iterator will use next to start a range scan from the located starting key. During this process, it will encounter MVCC data, tombstone data, etc.

  2. The explanation related to skipped_count can be seen below: delete_skipped_count in RocksDB represents the tombstone key after deletion, which is GC’d in TiKV. key_skipped_count includes the count of delete_skipped and MVCC, etc.

When deleting a key, RocksDB simply puts a marker, called tombstone, into the memtable. The original value of the key will not be removed until we compact the files containing the keys with the tombstone. The tombstone may even live longer even after the original value is removed. So if we have lots of consecutive keys deleted, a user may experience slowness when iterating across these tombstones. Counters internal_delete_skipped_count tell us how many tombstones we skipped. internal_key_skipped_count covers some other keys we skip.

  1. After turning off the compaction-filter, GC returns to the traditional mode, scanning the data and then marking it for deletion. This process also goes through Raft, and the GC efficiency might be lower compared to the compaction-filter mode.
| username: mydb | Original post link

Thank you for your reply.

| username: mydb | Original post link

v5.1.4 MVCC Test

create table t123 (id int primary key, name varchar(100), age int, city varchar(100));
insert into t123 values (1, ‘user1’, 1, ‘bj’);
update t123 set name=‘user2’ where id=1;
update t123 set name=‘user3’ where id=1;

# curl http://192.168.1.13:10080/mvcc/key/test/t123/1
{
“key”: “748000000000000C0B5F728000000000000001”,
“region_id”: 75055,
“value”: {
“info”: {
“writes”: [
{
“start_ts”: 437878836284620809,
“commit_ts”: 437878836284620812,
“short_value”: “gAADAAAAAgMEBQAGAAgAdXNlcjMBYmo=”
},
{
“start_ts”: 437878822036832288,
“commit_ts”: 437878822049939459,
“short_value”: “gAADAAAAAgMEBQAGAAgAdXNlcjIBYmo=”
},
{
“start_ts”: 437878822036832283,
“commit_ts”: 437878822036832284,
“short_value”: “gAADAAAAAgMEBQAGAAgAdXNlcjEBYmo=”
}
]
}
}
}

Current GC information
select VARIABLE_NAME, VARIABLE_VALUE from mysql.tidb;
| tikv_gc_last_run_time | 20221207-08:56:56 +0800
| tikv_gc_safe_point | 20221207-08:46:56 +0800

After waiting for GC, check the MVCC data again as follows
select VARIABLE_NAME, VARIABLE_VALUE from mysql.tidb;
| tikv_gc_last_run_time | 20221207-09:16:56 +0800
| tikv_gc_safe_point | 20221207-09:06:56 +0800

# curl http://192.168.1.13:10080/mvcc/key/test/t123/1
{
“key”: “748000000000000C0B5F728000000000000001”,
“region_id”: 75055,
“value”: {
“info”: {
“writes”: [
{
“start_ts”: 437878836284620809,
“commit_ts”: 437878836284620812,
“short_value”: “gAADAAAAAgMEBQAGAAgAdXNlcjMBYmo=”
},
{
“start_ts”: 437878822036832288,
“commit_ts”: 437878822049939459,
“short_value”: “gAADAAAAAgMEBQAGAAgAdXNlcjIBYmo=”
},
{
“start_ts”: 437878822036832283,
“commit_ts”: 437878822036832284,
“short_value”: “gAADAAAAAgMEBQAGAAgAdXNlcjEBYmo=”
}
]
}
}
}

Test result:
Not as expected, why is the MVCC data of this row still present after GC?

| username: Lucien-卢西恩 | Original post link

The data version queried through the TiDB MVCC API exists, and the actual TiDB GC operation will store the safepoint on the PD side. On the TiDB side, it can be considered as GC completed. However, in reality, TiKV will request the safepoint from the PD Server based on its internal task trigger logic, and then start deleting old versions of MVCC through Compaction tasks according to the safepoint. So this is the expected situation. It can be understood that the deletion of old versions is considered complete only after both TiDB and TiKV have completed the operation. Therefore, it is expected that the mvcc seen through the API query has not been cleaned up.

Additionally, because the peers of each Region group are on different TiKV instances, the GC tasks of TiKV instances are asynchronous. Therefore, some old MVCC versions may have been cleaned up, but some peers’ old versions have not been deleted, making it impossible to ensure that the snapshots before the compared safepoint are consistent. So usually, the old versions of MVCC here are guaranteed to be consistent.

After enabling the Compaction Filter, the GC efficiency will be higher than before because the Compaction filter takes on part of the previous GC single-threaded task operations, and processes them asynchronously on the TiKV side. This allows more TiKV instances to participate in the gc_key operation, speeding up the processing.

| username: h5n1 | Original post link

  • Optimize write operations and Batch Get in scenarios with many Tombstones #9729

You can take a look at this.

| username: mydb | Original post link

Thank you.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.