Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 有个问题,为啥删数据时候tidb的内存也跟着涨,这个不应该基本只是在kv上的操作么
[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version] 6.5
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issues: Issue Phenomenon and Impact]
[Resource Configuration]
[Attachments: Screenshots / Logs / Monitoring]
As mentioned, there are many such issues in the group when inserting large amounts of data.
Deleting data also generates data with a deletion mark within TiDB, which is then written to TiKV upon submission.
You have to find it out first. How can you delete it without finding it out? Aren’t you executing the command in TiDB?
If deleting by primary key, it should not increase memory. If deleting by non-primary key, it needs to bring up and parse the KV to see if it meets the deletion conditions. In fact, the deletion process is similar to a query, where the rows that meet the conditions are queried and then deleted.
After deleting a large amount of data, there will be many useless keys remaining, which will affect query efficiency, right?
This is the best answer. In fact, deletion for TiDB is also an insertion.
Why doesn’t deleting by primary key increase memory? Deletion also requires querying, converting to key-value pairs, and involves operations similar to insertion. Moreover, deletion and insertion are essentially the same except for different markers, right?
I roughly mean this:
If you delete according to the primary key, it is a point-get operation to fetch the data, no need to scan the entire table, and you can directly know which row of data it is.
If you delete not according to the primary key, you have to scan the table, fetch all the rows one by one, and parse them to know what the value of column b is. At this time, TiDB’s memory usage will increase rapidly.
In other words, KV does not know what column b in the SQL is, KV only records key:value. For this table, it is probably {tableid}_a: encode(a,b), which means that the value needs to be fetched to TiDB to parse what the value of column b is. Therefore, such non-primary key deletions will occupy a lot of memory.
If we talk about memory growth, everything will grow. Parsing an SQL, creating a new connection, all take up memory. However, non-primary key deletions require fetching and scanning everything, which definitely takes up a lot of memory.
This statement is not correct, right? If it is not the primary key, let’s assume it is a customer table where you search and delete based on the ID card number (the ID card number is an index, not a primary key). Then you should first search for this record according to the index, lock it, and then delete this record, which means you only need to fetch the primary key of this record, not scan the entire table.
begin
many insertSQL – All rows are held at the tidb-server layer and written to tikv during the prewrite phase
commit;
begin
many updateSQL – Read all rows that meet the conditions. If no modification is needed, the data is ignored during organization. If modification is needed, it is organized into new rows (insert) and held at the tidb-server layer, then written to tikv during the prewrite phase
commit;
begin;
many deleteSQL – Read the rowid of all rows that meet the conditions and write the mark to tikv during the prewrite phase
commit;
You’re right, with an index, a full table scan is not necessary.