TiFlash SQL Optimization Issues

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIFLASH sql优化问题

| username: jaybing926

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Problem Phenomenon and Impact]
We newly deployed 2 TiFlash nodes and added TiFlash replicas for some tables for testing. We found that the SQL query for a large table has a similar speed to when it was using TiKV (or even slower than TiKV). We are unsure if it is an issue with the table data or SQL optimization (because the data for this table was backed up from the old TiDB using BR, with a data volume of about 6 billion).

Most of our SQLs are similar to this:
select * from t1 where t1.a in (v1,v2,v3,v4,v5.......)

Example:
SELECT COUNT('*') AS `__count` FROM `xpost`  WHERE (`xpost`.`entryid` IN (4852480, 4852097, 4849282, 4855686, 4849799, 4849160, 4853516, 4852797, 4851728, 4850328, 4851609, 4849052, 4848542, 4854815, 4848160, 4855201, 4855076, 4853029, 4855089, 4851496, 4848681, 4852401, 4849330, 4856371, 4852190, 4851514, 4854986, 4851392, 4852384, 4851522, 4855748, 4851526, 4852306, 4855624, 4855626, 4848589, 4852045, 4848849, 4848594, 4855635, 4852308, 4855766, 4849495, 4855408, 4855134, 4848864, 4850320, 4852305, 4852841, 4857706, 4849261, 4857454, 4852079, 4851696, 4852392, 4846035, 4851956, 4852350, 4848247, 4849274, 4849403, 4848253, 4856062) AND (1) AND `xpost`.`is_comment` = 1 AND `xpost`.`hidden` IN (-2, -1, 0, 2, 3, 4) AND `xpost`.`sourcetype` IN (18) AND ((lower(domain) like '%抖音%')));

I saw in the documentation that such SELECT IN (SELECT) can be changed to select join. How can I optimize this kind of fixed IN clause?
Also, our SQLs do not use TiFlash by default. By using manual HINT, /*+ read_from_storage(tiflash[table_name]) */, we can make it use TiFlash.

| username: zhanggame1 | Original post link

TiFlash is a columnar storage, so it would be significantly faster if the condition you are querying is not on the index, right?

| username: tidb狂热爱好者 | Original post link

I see a big problem with your SQL. Why are you using a function on the left side?

| username: tidb菜鸟一只 | Original post link

If your SQL query conditions have an index on TiKV and the number of filtered rows is not particularly large, it is definitely better to use the TiKV index. Using TiFlash for performance optimization is generally not as effective.

| username: jaybing926 | Original post link

Can you directly help me optimize the SQL example I posted? It would be best if you could send me the optimized SQL directly, as I don’t quite understand it. :rofl: :rofl:

| username: tidb菜鸟一只 | Original post link

There’s no need for optimization. If your SQL queries are like this and the condition fields all have indexes, then there’s no need to use TiFlash. It’s already optimized as it is now.

| username: jaybing926 | Original post link

Okay.

| username: jaybing926 | Original post link

My table creation statement:

CREATE TABLE `xpost` (
  `postid` bigint(20) NOT NULL AUTO_INCREMENT,
  `facetid` int(10) NOT NULL,
  `entryid` int(10) NOT NULL,
  `title` varchar(255) NOT NULL DEFAULT '',
  `url` varchar(512) NOT NULL,
  `abstract` text DEFAULT NULL COMMENT 'Summary',
  `click` int(11) DEFAULT '0' COMMENT 'Clicks/Reads',
  `reply` int(11) DEFAULT '0' COMMENT 'Replies/Comments',
  `repost` int(11) DEFAULT '0' COMMENT 'Reposts',
  `praise` int(11) DEFAULT '0' COMMENT 'Likes',
  `collect` int(11) DEFAULT '0' COMMENT 'Favorites',
  `watch` int(11) DEFAULT NULL,
  `wordscount` int(11) DEFAULT '0' COMMENT 'Word Count',
  `keywordcount` int(11) DEFAULT NULL,
  `siteid` int(11) DEFAULT '0',
  `domain` varchar(60) DEFAULT '',
  `author` varchar(60) DEFAULT '',
  `author_id` varchar(60) DEFAULT '' COMMENT 'Author ID',
  `posttime` datetime NOT NULL,
  `include_t` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Storage Time',
  `type` tinyint(4) DEFAULT '0' COMMENT 'positive/negative/neutral',
  `source` int(11) DEFAULT '0',
  `hidden` tinyint(4) NOT NULL DEFAULT '0' COMMENT '-3: Identified Noise, -2: Restored Deleted Data, -1: Processed, 0: Unprocessed, 1: Deleted, 2: Main Content Noise, 3: Ambiguous Noise, 4: List Page Noise',
  `sourcetype` tinyint(4) NOT NULL,
  `crisis_post` tinyint(4) NOT NULL DEFAULT '0',
  `ontop` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Pinned, 0: No, 1: Yes',
  `type_rank` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Positive/Negative Credibility',
  `noise_rank` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Noise Credibility',
  `device` varchar(60) DEFAULT '' COMMENT 'Publishing Device',
  `is_origin` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Original Content, 0: Unknown, 1: Original, 2: Repost',
  `is_top` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Headline, 0: Unknown, 1: Yes, 2: No',
  `media_type` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Media Type, 0: Unknown, 1: Traditional Media, 2: Self-Media',
  `author_type` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Author Type, 0: Unknown, 1: KOL, 2: Official Account, 3: Water Army',
  `content_type` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Content Type, 0: Unknown, 1: PGC (Marketing Volume), 2: UGC (Natural Volume)',
  `client_type` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Website Source, 0: Unknown, 1: Web, 2: Mobile',
  `industry` varchar(60) DEFAULT '' COMMENT 'Industry',
  `tags` text DEFAULT NULL,
  `post_type` tinyint(4) DEFAULT '0' COMMENT 'Content Type, 0: Unknown, 1: PGC, 2: UGC',
  `type_reason` varchar(256) DEFAULT '' COMMENT 'Reason for Positive/Negative Rule Hit',
  `update_time` datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'Update Time',
  `origin_source` varchar(64) DEFAULT '' COMMENT 'Repost Source',
  `media_id` int(11) DEFAULT '0' COMMENT 'Media Library ID',
  `w_level` tinyint(1) NOT NULL DEFAULT '0' COMMENT 'Warning Level',
  `sid` bigint(20) DEFAULT NULL COMMENT 'Crawler Unique Identifier',
  `location` varchar(255) DEFAULT '' COMMENT 'Region Mentioned in Article',
  `is_comment` tinyint(4) DEFAULT '0',
  `pos_type_rank` int(11) NOT NULL DEFAULT '0' COMMENT 'Positive Credibility',
  `text` text DEFAULT NULL,
  `spider_time` datetime DEFAULT NULL COMMENT 'Crawl Time',
  `process_time` datetime DEFAULT NULL COMMENT 'Processing Start Time for upload_to_community',
  `tidb_in_time` datetime DEFAULT CURRENT_TIMESTAMP COMMENT 'Actual Time Uploaded to TiDB, Different from include_t, which is the final time in the code',
  PRIMARY KEY (`postid`) /*T![clustered_index] CLUSTERED */,
  UNIQUE KEY `idx_fid_url` (`facetid`,`url`),
  UNIQUE KEY `postid` (`postid`),
  KEY `idx_xpost_url` (`url`(255)),
  KEY `idx_xpost_entryid` (`entryid`),
  KEY `idx_xpost_type` (`type`),
  KEY `idx_xpost_sourcetype` (`sourcetype`),
  KEY `idx_xpost_pt` (`posttime`),
  KEY `idx_xpost_source` (`source`),
  KEY `idx_fid` (`facetid`),
  KEY `idx_xpost_hidden` (`hidden`)
)
| username: Hacker007 | Original post link

TiFlash will automatically optimize; it will execute statements that run faster on TiFlash itself, otherwise, it will hand them over to TiKV.

| username: jaybing926 | Original post link

The expert means: Whether using TiKV or TiFlash, there’s no need to optimize SQL, TiDB will automatically optimize it, right?

| username: Hacker007 | Original post link

TiDB does not optimize SQL; it decides whether to execute in TiFlash or TiKV based on different SQL statements.

| username: redgame | Original post link

Not optimized well.

| username: jaybing926 | Original post link

So the conclusion is: my SQL became slower when using TiFlash. Does that mean my SQL is not suitable for TiFlash? Is there any way to make it faster with TiFlash?

| username: kkpeter | Original post link

Post the execution plan for us to take a look.