Questions about the number of scan operators (is tidb_distsql_scan_concurrency ineffective?)

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 关于scan算子个数的疑问(tidb_distsql_scan_concurrency 无效?)

| username: 海石花47

[TiDB Usage Environment] Production Environment
[TiDB Version] v6.1.1
[Reproduction Path] In the same TiDB environment with the same parameter settings, there is an SQL query where the actual scan operator is just 2, and it immediately goes OOM and then kills the SQL???
[Resource Configuration]
Memory Settings:
Single SQL memory limit: tidb_mem_quota_query is the default 1G
OOM action: tidb_mem_oom_action is the default CANCEL
Concurrency Settings:
set session tidb_distsql_scan_concurrency = 15; all are default values
SET session tidb_executor_concurrency = 5;
set session tidb_index_serial_scan_concurrency = 1;

[Encountered Problem: Problem Phenomenon and Impact]
Under normal circumstances, a single SQL:
SELECT * FROM ods.tp_pay_order LIMIT 3000000;
The result is like this: single memory exceeds the limit, then 15 operators are closed one by one, until the last operator, still exceeds, then kills the SQL.

However, another SQL:
image
The result is like this: single memory exceeds the limit, then 2 operators are closed instantly, until the last operator, still exceeds, then kills (at the time of the screenshot, I had already changed the OOM action to log, otherwise it would affect the business)

My question is, with the same parameter configuration, why does the second SQL behave like this, with only 2 operators??

Welcome experts to discuss~

| username: 人如其名 | Original post link

This sentence is not very clear, does it mean only 2 operators or 2 tokens?

Can it be understood that your question is:
Why not cancel the statement when the remaining token count is the highest (after all, memory usage might be the highest at this time), but cancel the statement when there is only one token left?
If this is the question, then my understanding is as follows:
First of all, the oomAction behavior has priorities: cancel, log, rateLimit in increasing order. In your setting with both tidb_mem_oom_action=CANCEL and tidb_enable_rate_limit_action=ON, the higher priority behavior will be triggered first, which is the rateLimit behavior. This behavior is for flow control of distsql (a distsql only acts on one table). When the remaining token count is only 1, it will fallback to the next priority oomAction, which is cancel in your setting. Therefore, if the memory usage of the entire statement is still larger than tidb_mem_quota_query when the remaining token limit is 1, the current statement will be canceled. Although there is only 1 token left at this time, meaning the concurrent execution number of copTask is only 1, if the previously obtained copTask has not been consumed, the memory will not be released, and the memory usage will exceed the tidb_mem_quota_query setting value.

Regarding why the second SQL behaves this way, my understanding is:
Under normal circumstances (without large regions), with chunk_rpc enabled, the size of a region on the TiDB side is about 300MB. If there are only 2 copTask responses, it usually will not cause the statement to OOM, but your SQL statement involves a join, so it depends on the behavior of the join, which might consume more memory. Also, the total token count here is only for one table, the other table is not reflected. It is best to check the explain analyze to see how the memory is consumed.

Additionally, in actual execution, the concurrent scanning of tables does not necessarily rely entirely on the tidb_distsql_scan_concurrency parameter. This parameter value is the maximum concurrent scan number for table scans under general circumstances. If the number of regions to be scanned is less than the value set by tidb_distsql_scan_concurrency, then this many will not be used. Also, for ordered data requests, the scan concurrency will not be this high (fixed at 2 to avoid TiDB side caching too much data waiting for ordered execution, which increases the probability of OOM).

| username: 海石花47 | Original post link

Thank you for answering my question again, expert~
Actually, my question is, why is the total token count of the second SQL equal to 2, instead of the default 15 (tidb_distsql_scan_concurrency)? In my understanding, these two things are the same thing?

| username: 人如其名 | Original post link

The total token count = 2 indicates that your request involves 2 copTask requests.
Here are a few concepts related to distsql concurrent scanning as I understand them:

  1. tidb_distsql_scan_concurrency represents the maximum scan concurrency (simply understood as the number of concurrent scans on regions). If the table does not have that many regions (or the number of regions requested to be scanned is less than tidb_distsql_scan_concurrency), then the maximum scan concurrency will be as many as the number of requests. Suppose the actual concurrency is: real_tidb_distsql_scan_concurrency (used later).
  2. Token limit is the main method used for flow control. The total token count is the number of tokens initialized in the pool at the beginning (equal to real_tidb_distsql_scan_concurrency). The remain token count is the number of tokens currently available (initially equal to the total token count). When flow control is triggered, the remain token count will be destroyed one by one until the statement memory is less than tidb_mem_quota_query, with a minimum of 1 remaining. However, oomAction will fallback to the next priority oomAction.
  3. The actual execution information of the statement will include copTask requests. The copTask requests here are the results counted after the tikv response is returned and applied by tidb. For example: select * from table limit 1, assuming the table is very large, the number of copTasks seen in the execution information will only be 1. However, the actual copTask requests issued will be as many as tidb_distsql_scan_concurrency (this should be reflected in the logs), which means that in some scenarios, tikv resources are slightly wasted.
| username: 海石花47 | Original post link

I seem to understand the first two.

I checked, and the table corresponding to my first SQL has thousands of regions, so the real concurrency is the default 15.

The second SQL involves two tables, Table A has 6 regions, and Table B has only 2 regions, so is the real maximum parallelism calculated according to the smallest one?

| username: 海石花47 | Original post link

This seems a bit problematic and has me a bit confused. I just tested it,

explain analyze SELECT * FROM ods.tp_pay_order LIMIT 1000000,

and saw the result: “cop_task: {num: 7”, but in the logs: “total token count”=15.

It looks like copTask and total token count are not equivalent?

| username: 人如其名 | Original post link

Parallelism is at the distsql level, and at the distsql level, it is a single table request. That is, each table has a distsql request, and the concurrency of a distsql request is determined by tidb_distsql_scan_concurrency. The total token=2 here is for a single table, either table A or table B.

| username: 人如其名 | Original post link

Refer to this.

Your cop_task=7 indicates that TiDB received the 7th response just in time to meet the limit 1000000 condition. After the limit operator meets the requirement, it closes subsequent actions.

| username: 海石花47 | Original post link

Where can I learn about this structure and stuff?

| username: 人如其名 | Original post link

The official blog guides you to take a look at the source code.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.