Application of TiDB in Qichacha Data Platform and Upgrade Experience of Version 7.1

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB 在企查查数据中台的应用及 v7.1 版本升级体验

| username: 苏州刘三枪

This article introduces Qichacha’s experience and application of using TiDB in building its data middle platform. By migrating from MySQL to TiDB, Qichacha has constructed a real-time data warehouse framework based on TiDB+Flink, fully leveraging TiDB’s distributed architecture, MySQL compatibility, and comprehensive surrounding tools to achieve online data processing. As of September 2023, Qichacha’s TiDB database has been upgraded to version v7.1.1. The article also shares some useful features and version upgrade experiences of using TiDB, including the active TiDB open-source community and the importance of TiDB’s stability in their decision-making.

The authors of this article are Zhao He and Wang Yunhe, DBA team of Qichacha Big Data Architecture Department.

Qichacha is a technology company focused on enterprise credit information services. Relying on big data, artificial intelligence, and other technologies, it provides comprehensive, accurate, and timely enterprise credit information to help companies reduce costs, increase efficiency, and manage risks. In May 2023, Qichacha officially released the world’s first commercial investigation large model—“Zhibi Alpha.” This model is trained based on Qichacha’s global enterprise credit data and can provide multi-dimensional data services for judicial, financial, risk control, and government personnel.

Data is the core of Qichacha’s business, requiring massive data cleaning, analysis, and mining to fully unlock its value. Before introducing TiDB, Qichacha used MySQL databases. MySQL is a popular open-source relational database but has single-machine performance bottlenecks. When the data volume reaches a certain scale, vertical scaling can only improve performance to a limited extent, and performance is constrained by single-machine capabilities in high-concurrency write and complex SQL query scenarios.

Since MySQL is a single-machine database, it can only use hot backup without interrupting the business. However, as data volume grows, MySQL’s hot backup operations become increasingly slow, significantly impacting database performance. Additionally, the recovery speed of hot backup data is also slow. In Qichacha’s data flow, data collected by crawlers needs to be stored in the database first and then cleaned by Flink. Since MySQL does not support direct data delivery to Flink, Flink needs to read and write the database, putting significant pressure on the MySQL database.

At the end of 2019, we came into contact with TiDB through the TiDB community and became very interested in it. After comparative selection tests, we chose the TiDB database and built a real-time data warehouse framework with Flink+TiDB, applied to Qichacha’s data middle platform. The main reasons for choosing TiDB are:

◉ Almost no learning cost to switch to TiDB

Due to the many issues with MySQL, we urgently needed a database compatible with the MySQL protocol that could solve the above problems. TiDB performs excellently in MySQL compatibility, supporting most MySQL syntax and functions, and related tools in the MySQL ecosystem also support it by default. Additionally, the user experience of TiDB is almost identical to MySQL, making the switch to TiDB very friendly for our MySQL-based DBAs with almost no learning cost.

◉ Native distributed architecture brings significant advantages

Under the premise of MySQL protocol compatibility, we needed a distributed database that could flexibly scale horizontally to meet business development requirements. We compared and tested distributed databases with sharding and found that they had significant development intrusions and limited scalability. TiDB adopts a native distributed database architecture, designed based on Spanner and F1 papers. TiDB separates storage and computing, has no centralized nodes, supports arbitrary scaling, and supports distributed transactions. Additionally, TiDB’s data storage is based on the Raft consensus algorithm, with data sharding not requiring pre-planned shard keys, and defaulting to 3 replicas to ensure high data availability. Each component in the TiDB cluster is designed for high availability, ensuring service high availability.

◉ Comprehensive surrounding tools

TiDB’s surrounding tools are excellent, especially the monitoring system. TiDB’s monitoring system uses common components like Prometheus, Grafana, and Alertmanager, making it seamlessly integrate into our enterprise’s monitoring and alerting system. Additionally, TiDB’s monitoring system is very comprehensive, covering all aspects of system operation, making it easy to troubleshoot issues. TiDB’s upstream and downstream data migration and synchronization tools are also mature, especially the TiCDC tool. TiCDC supports synchronizing data from TiDB to Kafka and supports the commitTS feature to ensure data consistency. TiDB’s backup and recovery tools are also comprehensive, supporting logical backups (dumpling) and physical backups (BR) without interrupting the business. During the backup process, TiDB can execute backup tasks in parallel based on the capabilities of distributed nodes, significantly improving efficiency compared to MySQL single-machine backups.

◉ Active open-source community

TiDB’s community forum is very active, and questions we raise are quickly answered by other members. The community has new questions or answers every few minutes. Additionally, many tech enthusiasts write blogs and technical articles, which are very helpful for us in solving TiDB technical issues daily. We also participate in TiDB community offline events, where everyone actively speaks, sharing experiences and problems encountered while using TiDB. The community organizers also record issues well and adopt developers’ suggestions. This open and transparent community interaction makes us feel very confident in using TiDB.

◉ Friendly big data ecosystem

Data written into the database needs to be cleaned by Flink. TiDB’s open-source big data ecosystem collaboration is good, which also provides convenience for us to use TiCDC. By synchronizing TiDB’s data to Kafka through TiCDC, it facilitates Flink’s cleaning on one hand, and on the other hand, other downstream data platforms can consume data from Kafka, making it convenient and flexible.

TiDB is applied to Qichacha’s data middle platform system, covering the entire process from data collection to data cleaning, providing data storage and query. We replaced the original 20+ MySQL databases with 2 TiDB clusters. In the data cleaning process, we use TiDB’s built-in data synchronization tool TiCDC to synchronize data to other downstream databases and Kafka. Currently, nearly a thousand tables are synchronized. The data flow from data collection to data cleaning is achieved by capturing change data through TiCDC and synchronizing it to Kafka. Additionally, we use the CommitTs feature in TiCDC to ensure data consistency through optimistic locking before data updates downstream.

Qichacha Data Middle Platform System Logical Diagram

TiDB data lake ingestion uses a self-developed Flink Hybrid Source. Full shard data is obtained by querying TiDB, and incremental data is obtained by consuming the Changelog pushed to Kafka by TiCDC, written into the data lake Iceberg in near real-time (minute-level). Flink Hybrid Source supports three data synchronization modes: full, incremental, and full-incremental integration.

We synchronize some of TiDB’s data to the ES system, providing data sources for the ES system for some search scenarios. For offline data, we use the Chunjun/Seatunnel synchronization tool to synchronize it to the Hive offline data platform for downstream offline data platform batch processing. Currently, we are researching TiFlash’s functionality and plan to migrate some complex offline queries from Hive to TiDB this year, querying directly from TiDB to reduce data flow across multiple data stacks and further improve data real-time performance.

1. Online Data Value

TiDB cluster’s distributed read-write capability far exceeds MySQL. Whether it’s writing data from the source crawler to TiDB or writing cleaned data from Flink, TiDB can meet business needs. Combined with Flink’s real-time computing capability, TiDB ensures data real-time performance. Additionally, TiDB’s parallel data reading capability across nodes significantly enhances data distribution and query capabilities, making data value online.

2. Improved Data Flow Efficiency

TiDB is highly compatible with upstream and downstream data ecosystems, supporting standard JDBC writing at the access end. Source data can be directly written to TiDB, as simple as writing to MySQL. At the output end, TiDB can distribute data to downstream Kafka through TiCDC and ensure business data consistency through the CommitTS feature. It can also synchronize data to downstream big data platforms through standard interfaces, improving enterprise data flow efficiency and revitalizing data assets.

1. Sharing Some Useful Features

◉ Resource Control Meets Multi-Tenant Needs of Different Businesses

TiDB 7.1 version introduced the Resource Control feature, and we quickly upgraded to this version. After the upgrade, we did not apply resource control to normal program accounts in the query platform to ensure their resources are guaranteed; non-program accounts were partially resource-controlled to prevent excessive resource consumption from affecting the query efficiency of normal program accounts. This way, we integrated different types of businesses into one TiDB cluster, improving resource utilization and reducing costs by 30%. Additionally, TiDB’s resource control function provides multi-perspective monitoring, clearly understanding the resource usage of each business module.

◉ GC Recovery at Any Time Point

We set TiDB’s GC time to 28 hours, allowing us to read historical data from the past 28 hours. Additionally, if a mistaken deletion occurs, we can flashback and recover data within 28 hours. Compared to MySQL binlog recovery, this recovery method is more efficient.

◉ Automatic Hotspot Scheduling

In TiDB 3.0 and 4.0 versions, when encountering hotspot issues, TiDB’s handling capability was insufficient and required manual intervention. After upgrading to TiDB 7.1 version, the hotspot scheduling capability has significantly improved, automatically scheduling hotspot data and effectively solving hotspot issues.

2. Version Upgrade Experience

In September 2020, we upgraded TiDB to v4.0.6 and subsequently to v4.0.15. In the v4.0 version, we encountered some issues, including TiDB restart triggered by deleting large amounts of data, DDL blocking, and immature TiCDC issues. During this stage, we prioritized seeking answers in the TiDB community when encountering problems. Many experienced users and developers in the community provided help. We also actively participated in community discussions, sharing our experiences and contributing to the community. In August 2023, we upgraded to v6.5.3. In the v6.5 version, the above issues were resolved. The most notable improvements were the stability of TiCDC and the resolution of TiDB restart issues, with significant performance enhancements.

In September 2023, we upgraded to TiDB v7.1.1. After the upgrade, system performance significantly improved, with QPS peaks reaching 50-60K, and the 95th percentile response time reduced from over 60ms to 10-30ms. We also utilized the resource control feature in v7.1, meeting business needs well. In the v7.1 version, we encountered two issues:

Due to TiDB’s memory control parameters being adjusted from session-level to SQL-level, exceeding memory thresholds caused access blocking issues. We are actively seeking solutions.

The issue of TiCDC partition_num parameter being ineffective (reference: Tidb7.1.1’s Ticdc parameter partition-num ineffective (Tidb7.1.1的Ticdc参数partition-num无效 - TiDB 的问答社区)), which we have reported to the TiDB community and received feedback quickly, with a fix in issue: 9955 (kafka(ticdc): topic manager return the partition number specified in the sink-uri by 3AceShowHand · Pull Request #9955 · pingcap/tiflow · GitHub).

The resolution of these issues fully demonstrates the advantages of TiDB’s open-source model, namely the power of the community.

| username: tidb菜鸟一只 | Original post link

Nice share.

| username: Billmay表妹 | Original post link

Thank you for sharing.

| username: cassblanca | Original post link

I would like to ask, if gc_life_time is set to 28 hours, how much data redundancy will the MVCC mechanism cause? Personally, I understand that the main data in Qichacha should be dominated by write and modify scenarios, with fewer deletions. It should not cause huge multi-version data redundancy.

| username: wfxxh | Original post link

Thank you for sharing.

| username: onlyacat | Original post link

Thank you for sharing.

| username: zhang_2023 | Original post link

Thank you for sharing.

| username: Jellybean | Original post link

Thank you for sharing.

This is really great. By combining hot and cold storage archiving features, HTAP usage can be well implemented, highly recommended.
We previously moved our Hive machines over for use and eventually took the Hive cluster offline, no longer using it.

| username: redgame | Original post link

Thank you for sharing.

| username: WinterLiu | Original post link

Thank you for the wonderful sharing.

| username: TiDBer_uI8QIp1t | Original post link

Is this phenomenon where a few requests are processed first and others start processing only after they are done when there is a high level of concurrency on a page? Is there a solution for this now?

| username: 纯白镇的小智 | Original post link

Thank you for the senior’s sharing, it was very beneficial.

| username: zhanggame1 | Original post link

Thank you for sharing.

| username: gcworkerishungry | Original post link

:heart: Thank you for sharing.

| username: 宸凡_22 | Original post link

Thanks for sharing, I learned a lot.