In general scenarios, TiKV's copTask scans data in memory too slowly, slower than MySQL, PostgreSQL, etc

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 通用场景下tikv的copTask在内存中扫描数据太慢,比mysql、pg等都慢

| username: 人如其名

[TiDB Usage Environment] Poc
[TiDB Version] 6.6
[Reproduction Path] Perform a full table scan to observe the behavior of copTask in TiKV
[Encountered Problem: Phenomenon and Impact] copTask scans data too slowly
[Resource Configuration] All operations are single-threaded and executed in memory, not involving resource usage

In general scenarios, it is found that scanning data in TiKV is slower compared to MySQL or PostgreSQL. If it is due to the LSM tree, other databases based on LSM tree do not perform worse than MySQL in scanning. In actual tests, it is found that the full table scan capability of a single copTask in TiKV is 1-2 times slower than single-threaded MySQL and about 5 times slower than single-threaded PostgreSQL. I would like to ask why TiKV scans are so slow and whether there is room for optimization in the future?

Here, since all reads are from memory, the actual block size setting is not very significant, but it is provided here for reference. The test statement is: select sum(C_NATIONKEY) from customer; and it performs a full table scan. The table structure for all databases is:

mysql> show create table customer \G
*************************** 1. row ***************************
       Table: customer
Create Table: CREATE TABLE `customer` (
  `C_CUSTKEY` bigint NOT NULL,
  `C_NAME` varchar(25) NOT NULL,
  `C_ADDRESS` varchar(40) NOT NULL,
  `C_NATIONKEY` bigint NOT NULL,
  `C_PHONE` char(15) NOT NULL,
  `C_ACCTBAL` decimal(15,2) NOT NULL,
  `C_MKTSEGMENT` char(10) NOT NULL,
  `C_COMMENT` varchar(117) NOT NULL,
  PRIMARY KEY (`C_CUSTKEY`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

TiDB related tests are as follows:

mysql> select version();
+--------------------+
| version()          |
+--------------------+
| 5.7.25-TiDB-v6.6.0 |
+--------------------+
1 row in set (0.00 sec)

mysql> show config where name like 'rocksdb%block-size%';
+------+----------------------+------------------------------+-------+
| Type | Instance             | Name                         | Value |
+------+----------------------+------------------------------+-------+
| tikv | 192.168.31.201:20160 | rocksdb.defaultcf.block-size | 32KiB |
| tikv | 192.168.31.201:20160 | rocksdb.lockcf.block-size    | 16KiB |
| tikv | 192.168.31.201:20160 | rocksdb.raftcf.block-size    | 16KiB |
| tikv | 192.168.31.201:20160 | rocksdb.writecf.block-size   | 32KiB |
+------+----------------------+------------------------------+-------+
4 rows in set (0.01 sec)

-- Setting to 1 here is to facilitate the timing of a single copTask, with a single copTask consuming 100% CPU (one CPU)
mysql> set tidb_distsql_scan_concurrency=1;
Query OK, 0 rows affected (0.00 sec)

mysql> explain analyze select sum(C_NATIONKEY) from customer;
+----------------------------+------------+---------+-----------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------+-----------+------+
| id                         | estRows    | actRows | task      | access object  | execution info                                                                                                                                                                                                                                                                                          | operator info                                   | memory    | disk |
+----------------------------+------------+---------+-----------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------+-----------+------+
| StreamAgg_16               | 1.00       | 1       | root      |                | time:754.7ms, loops:2                                                                                                                                                                                                                                                                                   | funcs:sum(Column#11)->Column#9                  | 1.45 KB   | N/A  |
| └─TableReader_17           | 1.00       | 3       | root      |                | time:754.7ms, loops:2, cop_task: {num: 3, max: 306.3ms, min: 169ms, avg: 251.4ms, p95: 306.3ms, max_proc_keys: 603265, p95_proc_keys: 603265, tot_proc: 751ms, rpc_num: 3, rpc_time: 754.2ms, copr_cache: disabled, build_task_duration: 12.4µs, max_distsql_concurrency: 1}                            | data:StreamAgg_8                                | 442 Bytes | N/A  |
|   └─StreamAgg_8            | 1.00       | 3       | cop[tikv] |                | tikv_task:{proc max:305ms, min:168ms, avg: 250.3ms, p80:305ms, p95:305ms, iters:1466, tasks:3}, scan_detail: {total_process_keys: 1500000, total_process_keys_size: 305225771, total_keys: 1500003, get_snapshot_time: 76.4µs, rocksdb: {key_skipped_count: 1500000, block: {cache_hit_count: 10004}}}  | funcs:sum(tpch.customer.c_nationkey)->Column#11 | N/A       | N/A  |
|     └─TableFullScan_15     | 1500000.00 | 1500000 | cop[tikv] | table:customer | tikv_task:{proc max:246ms, min:132ms, avg: 204ms, p80:246ms, p95:246ms, iters:1466, tasks:3}                                                                                                                                                                                                            | keep order:false                                | N/A       | N/A  |
+----------------------------+------------+---------+-----------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------+-----------+------+
4 rows in set (0.76 sec)

From the execution plan information, total_process_keys: 1500000, total_process_keys_size: 305225771, total_keys: 1500003, it can be seen that all scanned data almost does not contain un-GC’d data, all are valid read data. From rocksdb: {key_skipped_count: 1500000, block: {cache_hit_count: 10004}}, it can be seen that all are logical reads (memory reads), with a count of 10004. The table scan operator is: TableFullScan_15, a total of 3 tasks were executed, with each task taking an average of avg: 204ms, so the total time for the table scan is 612 milliseconds.

MySQL related tests are as follows:

mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.32    |
+-----------+
1 row in set (0.00 sec)

mysql> show variables like '%innodb_page_size%';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| innodb_page_size | 16384 |
+------------------+-------+
1 row in set (0.00 sec)

mysql> show status like 'Innodb_data_reads';
+-------------------+---------+
| Variable_name     | Value   |
+-------------------+---------+
| Innodb_data_reads | 3710851 |
+-------------------+---------+
1 row in set (0.01 sec)

mysql> show status like 'Innodb_buffer_pool_read_requests';
+----------------------------------+-----------+
| Variable_name                    | Value     |
+----------------------------------+-----------+
| Innodb_buffer_pool_read_requests | 414272227 |
+----------------------------------+-----------+
1 row in set (0.00 sec)

mysql> explain  analyze select sum(C_NATIONKEY) from customer;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Aggregate: sum(customer.C_NATIONKEY)  (cost=234307.26 rows=1) (actual time=334.921..334.922 rows=1 loops=1)
    -> Table scan on customer  (cost=120530.06 rows=1137772) (actual time=0.097..237.381 rows=1500000 loops=1)
 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.33 sec)

mysql> show status like 'Innodb_data_reads';
+-------------------+---------+
| Variable_name     | Value   |
+-------------------+---------+
| Innodb_data_reads | 3710851 |
+-------------------+---------+
1 row in set (0.00 sec)

mysql> show status like 'Innodb_buffer_pool_read_requests';
+----------------------------------+-----------+
| Variable_name                    | Value     |
+----------------------------------+-----------+
| Innodb_buffer_pool_read_requests | 414310612 |
+----------------------------------+-----------+
1 row in set (0.00 sec)

From the Innodb_data_reads indicator before and after the statement execution, it can be seen that it is still 3710851, indicating no physical reads occurred. From the Innodb_buffer_pool_read_requests indicator before and after the statement execution, it can be seen that the number of logical reads is: 414310612 - 414272227 = **38385**. From the execution information: Table scan on customer (cost=120530.06 rows=1137772) (actual time=0.097..237.381 rows=1500000 loops=1), it can be seen that the actual execution time for the table scan is 237.381 milliseconds.

PostgreSQL related tests are as follows:

tpch=# select version();
                                                 version                                                 
---------------------------------------------------------------------------------------------------------
 PostgreSQL 15.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
(1 row)

tpch=# set max_parallel_workers_per_gather=0;
SET

tpch=# set track_io_timing=on;
SET

tpch=# SELECT current_setting('block_size');
 current_setting 
-----------------
 8192
(1 row)

tpch=# explain (analyze true,buffers true,verbose true,timing) select sum(C_NATIONKEY) from customer;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=56484.90..56484.91 rows=1 width=32) (actual time=234.525..234.526 rows=1 loops=1)
   Output: sum(c_nationkey)
   Buffers: shared hit=37734
   ->  Seq Scan on public.customer  (cost=0.00..52734.72 rows=1500072 width=8) (actual time=0.012..111.046 rows=1500000 loops=1)
         Output: c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment
         Buffers: shared hit=37734
 Planning Time: 0.060 ms
 Execution Time: 234.556 ms
(8 rows)

From the execution information, it can be seen that all are logical reads with a total of 37734 times. From the table scan here: Seq Scan on public.customer (cost=0.00..52734.72 rows=1500072 width=8) (actual time=0.012..111.046 rows=1500000 loops=1), it can be seen that the single-threaded table scan took 111.046 milliseconds.

When parallelism is enabled (e.g., TiDB default distsql_scan_concurrency=15), other databases can also achieve good acceleration, almost linear with sufficient CPU. Some hybrid row-column databases perform even better. Longer full memory scan times mean longer CPU time, so more resources are needed for the same processing volume. Therefore, I would like to ask why the logical read time for copTask scans in TiKV is so long for the same data volume? Is there room for optimization in the future?

| username: xfworld | Original post link

First, what are the background and configuration of the test?
Then, what are the objectives and conditions involved?
Next, what are your expectations?
Finally, what kind of answer do you want?

| username: h5n1 | Original post link

Highlight the key points.

| username: 人如其名 | Original post link

Big bro, this key point hits the nail on the head.

| username: tonyxuqqi | Original post link

@As the name suggests
“total_process_keys: 1500000, total_process_keys_size: 305225771, total_keys: 1500003. We can see that all the scanned data almost does not include non-GC data, all are valid read data. From rocksdb: {key_skipped_count: 1500000, block: {cache_hit_count: 10004}}”
The block cache hit is only 10004, with a total of 1500000 keys. Physical reads should have occurred here, right? If convenient, please upload the monitoring data to the clinic (PingCAP Clinic Overview | PingCAP Docs) for further analysis.

| username: zhanggame1 | Original post link

I tested similar statements, why is there no value in rocksdb: here?

| username: 人如其名 | Original post link

There should be a physical read happening here, right?

mysql> explain analyze select sum(L_EXTENDEDPRICE) from lineitem;
+----------------------------+------------+---------+-----------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+-----------+------+
| id                         | estRows    | actRows | task      | access object  | execution info                                                                                                                                                                                                                                                                                                                                                    | operator info                                        | memory    | disk |
+----------------------------+------------+---------+-----------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+-----------+------+
| HashAgg_11                 | 1.00       | 1       | root      |                | time:1.37s, loops:2, partial_worker:{wall_time:1.366558591s, concurrency:5, task_num:1, tot_wait:6.832190204s, tot_exec:31.755µs, tot_time:6.832224842s, max:1.366448056s, p95:1.366448056s}, final_worker:{wall_time:0s, concurrency:5, task_num:1, tot_wait:6.832286057s, tot_exec:16.792µs, tot_time:6.832305524s, max:1.366465061s, p95:1.366465061s}         | funcs:sum(Column#18)->Column#17                      | 20.2 KB   | N/A  |
| └─TableReader_12           | 1.00       | 13      | root      |                | time:1.37s, loops:2, cop_task: {num: 13, max: 1.37s, min: 708.7ms, avg: 998.8ms, p95: 1.37s, max_proc_keys: 467326, p95_proc_keys: 467326, tot_proc: 12.9s, tot_wait: 55ms, rpc_num: 13, rpc_time: 13s, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                                                      | data:HashAgg_5                                       | 436 Bytes | N/A  |
|   └─HashAgg_5              | 1.00       | 13      | cop[tikv] |                | tikv_task:{proc max:1.36s, min:687ms, avg: 985.2ms, p80:1.07s, p95:1.36s, iters:5869, tasks:13}, scan_detail: {total_process_keys: 6001215, total_process_keys_size: 1166860203, total_keys: 6001228, get_snapshot_time: 59.8ms, rocksdb: {key_skipped_count: 6001215, block: {cache_hit_count: 43, read_count: 18391, read_byte: 554.2 MB, read_time: 877.6ms}}} | funcs:sum(tpch1.lineitem.l_extendedprice)->Column#18 | N/A       | N/A  |
|     └─TableFullScan_10     | 6622334.00 | 6001215 | cop[tikv] | table:lineitem | tikv_task:{proc max:1.28s, min:630ms, avg: 924.2ms, p80:1.02s, p95:1.28s, iters:5869, tasks:13}                                                                                                                                                                                                                                                                   | keep order:false                                     | N/A       | N/A  |
+----------------------------+------------+---------+-----------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------+-----------+------+
4 rows in set (1.37 sec)

block: {cache_hit_count: 43, read_count: 18391, read_byte: 554.2 MB, read_time: 877.6ms} This kind of read_count indicates a physical read, right? The one in my post only has logical reads.
Moreover, this slow situation is not an isolated case; it is common. So you can easily test it yourself in any environment.

| username: 人如其名 | Original post link

| username: redgame | Original post link

Try configuring TiKV’s readwrite-splitting and region merge/split settings to optimize TiKV cluster performance.

| username: tonyxuqqi | Original post link

Qualitatively speaking, the read performance of LSM trees is somewhat worse than that of B-trees. There are some mitigations you can try:

  1. storage.block-cache.num-shard-bits, default is 6. If the block cache is set to more than 8GB, you can increase num-shard-bits to log2(block cache capacity / 128MB). Increasing this can enhance the concurrency of the block cache.
  2. rocksdb.writecf/defaultcf.num-levels - Reducing the number of levels can help with reads but will increase write amplification. Therefore, this needs to be balanced based on specific requirements and workloads.

From a product perspective, there are some read optimizations that can be done for LSM trees, but these changes are still under exploration and cannot be implemented in the short term.
Could you provide some details about your hardware configuration, data size, and your specific requirements?

Additionally, have you enabled TiFlash? TiDB will automatically choose between TiKV or TiFlash, which can result in significant performance improvements in certain scenarios.

| username: 人如其名 | Original post link

Yes, this is acceptable. However, another domestic distributed database that also uses LSMTree has a much faster pure memory scan speed, several to ten times faster than ours. This is hard to accept… So I still want to see where exactly it is slow. I installed TiDB and the other database in the same environment and tested them as described above. Personally, I feel it has nothing to do with the configuration, as I have reduced the test scenario to single-threaded pure memory operations. Moreover, both TiDB and the other database are single-node pure full table scans.

Additionally, we will use TiFlash for key complex scenarios, but TiKV is the foundation, and not every system will necessarily use TiFlash. Therefore, I hope TiKV can be further optimized for data scanning. Currently, I want to see where exactly it is slow, for example, is it the next scan of RocksDB that is slow, the get operation that is slow, or is it the task in TiKV that is slow when determining data visibility, or something else?

| username: zhanggame1 | Original post link

Is a certain b an ob?

| username: tonyxuqqi | Original post link

OB and TiKV use different engines. TiKV uses RocksDB, which currently does not have special optimizations for reads. We will also work on this optimization later.

| username: kuiper | Original post link

This block cache hit refers to the number of blocks, not the number of keys. According to the 32KB per block provided above, it should be quite large.