Optimization of TiDB Query Speed for Count(*) Statements with Over 100 Million Records

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb数据量count(*)查询语句过亿数据查询速度优化

| username: 艾维iii

Optimization of count(*) query speed for TiDB with over a billion data records

With more than 400 tables and a total data volume exceeding tens of billions, and the need to update newly added data daily, the count(*) query speed is too slow.

| username: GreenGuan | Original post link

If resources allow, you can deploy TiFlash.

| username: tidb菜鸟一只 | Original post link

This is obviously within the scope of TiFlash… Deploy TiFlash on a separate machine, and it will take off immediately.

| username: Raymond | Original post link

Try TiFlash, it’s lightning fast.

| username: forever | Original post link

Let’s get TiFlash started.

| username: DBAER | Original post link

TiFlash is awesome.

| username: onlyacat | Original post link

Take a look at the explain execution plan. I remember there was an issue with count in the lower versions.

| username: zhaokede | Original post link

Is there an execution plan?

| username: miya | Original post link

This issue can be addressed by first describing the detailed usage scenario. For example: whether it is updating from another database or updating within the same database. Specific problems require specific solutions.
For instance: if updating from another database, you can establish a database synchronization link, and TiDB has many tools to support this. If it is a batch update, you can also check the TiDB documentation, as TiDB supports batch updates for large volumes of data.

| username: zhanggame1 | Original post link

Is there a unique index? The data row is long, you can try it.

| username: 像风一样的男子 | Original post link

You can get the total length from the INFORMATION_SCHEMA.TABLES table, but this is not real-time. This value is accurate only after a full table analysis.

select TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS from INFORMATION_SCHEMA.`TABLES` where table_name='xxxxx'
| username: oceanzhang | Original post link

TiFlash is just a bit more disk-consuming.

| username: 艾维iii | Original post link

Yes, that’s not accurate, it can’t be used.

| username: 艾维iii | Original post link

How can I check if there is an execution plan?

| username: 艾维iii | Original post link

There are always different users pushing data into the cluster, and the total amount of data and data growth need to be counted every day.

| username: 像风一样的男子 | Original post link

Another cost-effective method is to record the largest ID each day. When calculating incremental data, use select count(*) from xxx where id > xxxxx.

| username: 艾维iii | Original post link

Incremental fields are varied and this is a bit troublesome.

| username: 艾维iii | Original post link

Does deploying TiFlash require restarting the cluster?

| username: 像风一样的男子 | Original post link

Does your table have a primary key id? This is auto-incremented and will not be duplicated.

| username: TiDBer_JUi6UvZm | Original post link

Try running it with EXPLAIN ANALYZE, and then post the results so we can take a look.