Is TiDB Suitable for Data Warehousing?

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB是否适合做数据仓库

| username: TiDBer_oeMANMf2

Is TiDB suitable for use as a data warehouse? If used as a data warehouse, what ETL tools are generally used for data layering, and what scheduling tools are used?

| username: zhouzeru | Original post link

TiDB can serve as a storage engine for data warehouses, supporting online distributed transactions and SQL queries, with good scalability and high availability. It can integrate with common data warehouse tools and ecosystems, such as Apache Spark, Apache Flink, Presto, etc.

For data layering in data warehouses, time-based partitioning is usually adopted, which means dividing data into different partitions based on the data generation time, such as by day, week, month, etc. This facilitates time-range queries and data deletion operations.

The choice of ETL tools and scheduling tools needs to be based on actual requirements and scenarios. TiDB supports the use of various ETL tools for data import and export, such as Apache Spark, Apache Flink, Kettle, etc. For scheduling tools, common ones include Apache Airflow, Azkaban, Oozie, etc. The choice should be made according to the actual scenario, and customized development should be carried out based on actual needs.

In summary, TiDB, as a storage engine for data warehouses, has excellent performance and scalability. It can integrate with various data warehouse tools and ecosystems, and data layering, as well as the choice of ETL tools and scheduling tools, can be made based on actual requirements.

| username: cassblanca | Original post link

TiDB is an HTAP hybrid load, but currently does not support stored procedures and the like, making it not very friendly for ETL. It depends on individual needs.

| username: redgame | Original post link

It can be done using Apache Spark.

| username: 像风一样的男子 | Original post link

TiDB has relatively high disk requirements, and the storage cost is much higher compared to traditional data warehouses.

| username: zhanggame1 | Original post link

Considering that storing three copies of data is the starting point, plus TiFlash, it might be a bit too wasteful of storage space.

| username: ShawnYan | Original post link

Have you heard about Hologres? How does it compare?