After upgrading TiDB to 7.5.1, the execution plan prioritizes TiKV over TiFlash, causing TiKV CPU to be fully utilized and crash

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb升级7.5.1之后执行计划优先走tikv而不是tiflash 导致tikvCPU打满崩溃

| username: 何明_亿玛

[TiDB Usage Environment] Production Environment
[TiDB Version]
7.5.1
[Reproduction Path] What operations were performed when the issue occurred
There were no issues when tikv and tiflash were upgraded to 7.5.1 while tidb was still at 6.1.5 (some at 6.1.7).
After upgrading tidb to 7.5.1 at night, tikv pressure increased sharply.

[Encountered Issue: Problem Phenomenon and Impact]
Abnormal statistics for a large number of tables, with many showing 0 in show stats_healthy. Some still show 0 even after analyze.

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachment: Screenshot/Log/Monitoring]
Load changes for tikv and tiflash after upgrading tidb around 22:30 on April 22

| username: 何明_亿玛 | Original post link

On April 25th, there were frequent occurrences of TiKV using over 2000 CPUs. Around 18:00, there was an instance where the CPU was fully utilized for 16 minutes. In the slow query log, a query SQL was found to have executed for 16 minutes. The execution plan showed that TiKV was used instead of TiFlash.

Many execution plans default to using TiKV. After setting tidb_enforce_mpp=ON, TiFlash can be used. However, this parameter can only be set at the session level and not globally.

| username: h5n1 | Original post link

It’s better to first recollect and gather statistics for those that don’t use TiFlash.

| username: tidb狂热爱好者 | Original post link

First, disconnect the business and collect the statistics. Continuous querying won’t even allow the collection of statistics.

| username: DBAER | Original post link

Mark it and check the solution approach.

| username: mono | Original post link

Force those slow queries that take more than 10 minutes to use TiFlash first. For major version upgrades, make sure to perform thorough validation and testing beforehand!

| username: mono | Original post link

Based on years of production practice experience, try not to use the latest version in the production environment. There are fewer people encountering issues in actual applications. :grinning:

| username: zhang_2023 | Original post link

What is the mechanism of automatic coordination?

| username: h5n1 | Original post link

Check the value of the tidb_allow_mpp variable, and upload the SQL that doesn’t use flash, the table structure, and the execution plan.

| username: zhanggame1 | Original post link

First, perform table analysis and take a look.

| username: foxchan | Original post link

The image is not available for translation. Please provide the text content directly.

| username: foxchan | Original post link

Attempted operations:

  1. Drop stats and rerun the execution plan, health is 99
  2. Re-add and delete TiFlash replicas
  3. Global tidb_allow_mpp=on is enabled

Take this SQL as an example:

EXPLAIN
SELECT
a.account_id,
a.campaign_id,
a.ad_id,
IFNULL(b.amount, a.amount) AS amount
FROM
(
SELECT
account_id,
campaign_id,
adgroup_id AS ad_id,
0 AS amount
FROM
Synads_Gdt_adgroup
WHERE
account_id IN (
SELECT
DISTINCT platform_account_id
FROM
mbg_core.emarbox_project ep
INNER JOIN mbg_core.yxt_finance_customer_entity yfce ON ep.pig_advertiser_id = yfce.id
WHERE
ep.media_id = 2
AND yfce.customer_id = 2250
AND (
(
ep.begin_date IS NULL
AND ep.end_date IS NULL
)
OR (
ep.begin_date IS NULL
AND ep.end_date >= CURDATE()
)
OR (
ep.begin_date <= CURDATE()
AND ep.end_date IS NULL
)
OR (
ep.begin_date <= CURDATE()
AND ep.end_date >= CURDATE()
)
)
)
AND is_deleted = 0
) a
LEFT JOIN (
SELECT
account_id,
campaign_id,
ad_id,
ROUND(IFNULL(SUM(amount), 0), 2) AS amount
FROM
(
SELECT
account_id,
campaign_id,
adgroup_id AS ad_id,
IFNULL(cost, 0) / 100 AS amount
FROM
synrpt_gdt_adgroup_daily_reporting_part
WHERE
DATE >= DATE_FORMAT(DATE_SUB(CURDATE(), INTERVAL 0 DAY), ‘%Y-%m-%d’)
AND DATE < DATE_FORMAT(DATE_ADD(CURDATE(), INTERVAL 1 DAY), ‘%Y-%m-%d’)
AND account_id IN (
SELECT
DISTINCT platform_account_id
FROM
mbg_core.emarbox_project ep
INNER JOIN mbg_core.yxt_finance_customer_entity yfce ON ep.pig_advertiser_id = yfce.id
WHERE
ep.media_id = 2
AND yfce.customer_id = 2250
AND (
(
ep.begin_date IS NULL
AND ep.end_date IS NULL
)
OR (
ep.begin_date IS NULL
AND ep.end_date >= CURDATE()
)
OR (
ep.begin_date <= CURDATE()
AND ep.end_date IS NULL
)
OR (
ep.begin_date <= CURDATE()
AND ep.end_date >= CURDATE()
)
)
)
) d
GROUP BY
ad_id,
account_id
) b ON a.account_id = b.account_id
AND a.campaign_id = b.campaign_id
AND a.ad_id = b.ad_id
WHERE
IFNULL(b.amount, a.amount) > 1000;

synads_gdt_adgroup health
image

Number of TiFlash replicas
image

Table structure
CREATE TABLE synads_gdt_adgroup (
adgroup_id bigint(60) NOT NULL COMMENT ‘Ad group id’,
adgroup_name varchar(255) DEFAULT NULL COMMENT ‘Ad group name’,
account_id bigint(60) NOT NULL COMMENT ‘Advertiser account id’,
campaign_id bigint(60) DEFAULT NULL COMMENT ‘Campaign id’,
site_set varchar(1024) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Placement set’,
automatic_site_enabled tinyint(1) DEFAULT NULL COMMENT ‘Is automatic placement enabled’,
optimization_goal varchar(100) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Ad optimization goal type’,
billing_event varchar(100) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Billing type’,
bid_amount int(11) DEFAULT NULL COMMENT ‘Bid amount, in cents’,
daily_budget bigint(20) DEFAULT NULL COMMENT ‘Daily budget of the ad group, in cents, set to 0 means no budget’,
promoted_object_type varchar(500) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Promotion target type’,
promoted_object_id varchar(512) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Promotion target id’,
app_android_channel_package_id varchar(100) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Android app channel package id’,
mini_game_program_id varchar(500) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘WeChat mini-game id’,
targeting_id bigint(100) DEFAULT NULL COMMENT ‘Targeting id, only valid for non-circle of friends ads’,
targeting varchar(16384) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Detailed targeting settings, storing all targeting conditions’,
targeting_translation text CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Description of selected targeting conditions’,
is_include_unsupported_targeting tinyint(1) DEFAULT NULL COMMENT ‘Whether to include unsupported targeting’,
scene_spec varchar(2048) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Scene targeting’,
begin_date varchar(25) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Start date of delivery, date format: YYYY-MM-DD, and the date is less than or equal to end_date’,
first_day_begin_time varchar(25) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Start time of delivery on the first day: HH:ii:ss’,
end_date varchar(25) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘End date of delivery, date format: YYYY-MM-DD, greater than or equal to today, and greater than or equal to begin_date; if you want to deliver for a long time, pass an empty string’,
time_series varchar(1024) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Delivery time period’,
configured_status int(5) NOT NULL COMMENT ‘Customer set status 1 represents AD_STATUS_NORMAL 0 represents AD_STATUS_SUSPEND’,
customized_category varchar(100) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Custom category’,
created_time bigint(20) NOT NULL DEFAULT ‘1585670400’ COMMENT ‘Creation time (timestamp)’,
last_modified_time bigint(20) DEFAULT NULL COMMENT ‘Last modification time (timestamp)’,
ad_count bigint(20) DEFAULT NULL COMMENT ‘Number of ads, number of creatives query’,
dynamic_ad_spec varchar(2048) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Dynamic product ad attributes’,
user_action_sets varchar(2048) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘User behavior data source’,
additional_user_action_sets varchar(2048) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Supplementary app user behavior data source, field to be deprecated’,
is_deleted tinyint(1) NOT NULL COMMENT ‘Is deleted, true: yes, false: no’,
dynamic_creative_id bigint(20) DEFAULT NULL COMMENT ‘DC dynamic creative id’,
is_rewarded_video_ad tinyint(1) DEFAULT NULL COMMENT ‘Is it a rewarded video ad’,
cost_guarantee_message varchar(1024) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Cost guarantee status information description’,
cost_guarantee_status varchar(1024) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Cost guarantee status, cost guarantee status definition’,
cold_start_audience varchar(1024) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Cold start seed audience’,
auto_audience tinyint(1) DEFAULT NULL COMMENT ‘Is system optimization used’,
expand_enabled tinyint(1) DEFAULT NULL COMMENT ‘Is automatic expansion used’,
expand_targeting varchar(100) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL COMMENT ‘Expansion cannot break through targeting’,
deep_conversion_spec varchar(2048) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘oCPC/oCPM deep optimization content’,
deep_optimization_action_type varchar(2048) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘CPC/oCPM deep optimization method configuration’,
poi_list varchar(2048) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Store id list’,
conversion_id bigint(20) DEFAULT NULL COMMENT ‘Conversion id’,
deep_conversion_behavior_bid bigint(20) DEFAULT NULL COMMENT ‘Bid for deep optimization behavior’,
deep_conversion_worth_rate float DEFAULT NULL COMMENT ‘Expected ROI of deep optimization value’,
android_channel_package_audit_message varchar(100) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Ad package channel package audit message’,
system_status varchar(100) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Ad group status in the system’,
bid_mode varchar(100) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Bidding method’,
status varchar(100) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Ad status’,
bid_adjustment text CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Bid coefficient setting content’,
auto_acquisition_enabled tinyint(1) DEFAULT NULL COMMENT ‘Is one-click volume enabled’,
auto_acquisition_budget int(11) DEFAULT NULL COMMENT ‘One-click volume exploration budget, in cents, once set, cannot be modified’,
creative_display_type varchar(1024) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Creative display type’,
auto_derived_creative_enabled tinyint(1) DEFAULT NULL COMMENT ‘Is automatic derived video creative enabled’,
smart_bid_type varchar(100) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Bid type’,
smart_cost_cap int(11) DEFAULT NULL COMMENT ‘Under automatic bidding’,
marketing_scene varchar(256) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL COMMENT ‘Marketing scene’,
synch_time varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL COMMENT ‘Format: yyyy-mm-dd HH:MM:SS, time when synchronization is called’,
synch_create_date varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL COMMENT ‘Format: yyyy-mm-dd’,
synch_create_hour int(11) NOT NULL COMMENT ‘Hour value range: 0~23’,
PRIMARY KEY (adgroup_id,configured_status) /*T![clustered_index] NONCLUSTERED */,
KEY idx_created_time (created_time),
KEY account_id_status (account_id,status),
KEY unq_agid (adgroup_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin /*T! SHARD_ROW_ID_BITS=4 PRE_SPLIT_REGIONS=2 */ COMMENT=‘Sync Guangdiantong to get ad group data’
PARTITION BY LIST (configured_status)
(PARTITION p0 VALUES IN (0),
PARTITION p1 VALUES IN (1))

Execution plan
image

| username: h5n1 | Original post link

Do all tables have flash replicas?

| username: foxchan | Original post link

There are replicas, and after reanalyzing all the tables and re-adding TiFlash, it still always uses TiKV.

| username: mono | Original post link

Have you tried using TiFlash? Let’s see what happens. Could it be that the optimizer determines that using TiKV has a lower overhead?

| username: foxchan | Original post link

Hint will use TiFlash. The problem now is that there are hundreds of different WHERE clauses in SQL, and I can’t hint each one individually.

| username: tidb狂热爱好者 | Original post link

I personally feel that it is caused by incorrect parameter configuration.

| username: TiDBer_JUi6UvZm | Original post link

Pay attention, currently evaluating the possibility of upgrading to version 7.

| username: shigp_TIDBER | Original post link

After the upgrade, it is best to update the statistics.

| username: h5n1 | Original post link

  1. Accurate statistics, why is reading from TiFlash faster, but the optimizer chose TiKV?
    Currently, the cost model distinguishing TiFlash and TiKV is still relatively rough. You can try reducing the value of tidb_opt_seek_factor to make the optimizer more inclined to choose TiFlash.

Try this, you can test it at the session level first.

tidb_opt_seek_factor

  • Scope: SESSION | GLOBAL
  • Persisted to cluster: Yes
  • Controlled by Hint SET_VAR: Yes
  • Type: Float
  • Range: [0, 2147483647]
  • Default value: 20
  • Represents the initial cost for TiDB to request data from TiKV. This variable is used internally by the cost model and it is not recommended to modify its value.