Optimizing User Profiling Queries in TiDB: Handling Tens of Millions to Billions of Tag and User Data with Join Queries in 20-30 Seconds. Is There a Better Solution Than Bitmap Storage?

translator_bot · June 22, 2024, 9:43am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 当前业务场景是：用户画像走的是关联表，标签数据和用户数据关联查询。标签和用户数据的规模为：千万级、亿级，目前关联查询时间约 20-30秒，之前调研了bitmap 存储方案。这种场景基于tidb有没有更好方案？

| username: xiaoxiaozuofang

【TiDB Usage Environment】Production Environment
【TiDB Version】tidb v6.1.0
【Reproduction Path】Operations performed that led to the issue
【Encountered Issue: Slow Join Query】
【Resource Configuration】5 KV nodes (8 cores, 32GB) 3 TiFlash nodes (16 cores, 64GB)
【Attachments: Screenshots/Logs/Monitoring】

translator_bot · June 22, 2024, 9:43am

| username: realcp1018 | Original post link

For tens of millions of users and billions of profile tags, with each user having an average of dozens of tags, even if a single user has hundreds of tags, querying for a single user should not take 20-30 seconds. So it’s very strange; this issue is not about which database is used, but rather how the previous setup could be so slow.

translator_bot · June 22, 2024, 9:43am

| username: Kongdom | Original post link

We are currently facing similar confusion. However, it is mainly concentrated on tag computation. User profiling is generally fine with point queries, but it is relatively slow when doing mining and recommendations.

translator_bot · June 22, 2024, 9:43am

| username: TiDBer_pkQ5q1l0 | Original post link

We used to use Redis, but it really consumes a lot of machine resources.

translator_bot · June 22, 2024, 9:43am

| username: cy6301567 | Original post link

We are currently using Alibaba Cloud ADB, which is fine but too expensive. We are now also researching whether TiDB’s TiFlash supports OLAP capabilities for large amounts of data.

translator_bot · June 22, 2024, 9:43am

| username: xfworld | Original post link

TiFlash supports MPP, you can try a POC.

translator_bot · June 22, 2024, 9:43am

| username: ljluestc | Original post link

Cache: Use caching mechanisms, such as Redis, to cache the results of popular queries, thereby avoiding frequent database queries.

Use TiFlash: TiFlash is a columnar storage database in the TiDB ecosystem, specifically designed for OLAP queries.

translator_bot · June 22, 2024, 9:43am

| username: redgame | Original post link

Continue using Redis.

translator_bot · June 22, 2024, 9:43am

| username: Kongdom | Original post link

Redis can’t store such a large amount of data, right?

translator_bot · June 22, 2024, 9:43am

| username: cassblanca | Original post link

Check out TiDB Serverless architecture in the cloud, with Pay-As-You-Go pricing.

translator_bot · June 22, 2024, 9:43am

| username: Billmay表妹 | Original post link

You can search for Ctrip’s practice: