Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.Original topic: 月活超 1.1 亿,用户超 4 亿,你也在用的「知乎」是如何在超大规模 TiDB 集群上玩转多云多活的?
Introduction
In the “2024 New Year Tea Gathering,” I shared the topic “TiDB Practices at Zhihu,” reviewing the latest progress in Zhihu’s TiDB practices over the past two years and my personal views on the future direction of databases.
Video link:
Zhihu has a long history of using TiDB, with over 400 TiDB Server nodes and 600 TiKV nodes deployed, reaching a data scale of PB level. Zhihu’s largest TiDB cluster has a scale of over 500 TB, with a single replica capable of storing 30TB of data per day, and three replicas can store nearly 100TB. At this scale, about 30% of Zhihu’s core scenarios run on TiDB.
From the perspective of database versions, currently, about 80% of Zhihu’s TiDB clusters are on TiDB 4.x versions, around 10% are on TiDB 5.4.3, and a small number are on 6.5 versions. The adoption of 6.x versions is mainly due to some new features that attract us, such as improvements in hot table handling, CDC capabilities, and multi-cloud active-active capabilities.
1 Multi-Cloud Active-Active Solution Practice
The first practice shared is “multi-cloud active-active.”
Currently, Zhihu is the only company in the TiDB community that has implemented multi-cloud active-active in an internet scenario. At the beginning of 2023, we had some exchanges with PingCAP’s internal product and research team regarding the multi-cloud active-active architecture.
TiDB’s official documentation actually provides many multi-cloud active-active solutions:
The first is the primary-backup cluster solution based on TiCDC. TiCDC’s capabilities can solve some multi-cloud active-active needs, similar to using MySQL’s Master and Slave method for TiDB clusters. The key point of this architecture lies in the synchronization capability of CDC. In TiDB 4.x, CDC’s synchronization capability was indeed weak. Previously, in my former company, I collaborated with PingCAP on the TiCDC project, proposing and verifying some scenarios, applications, and requirements for TiCDC. For example, early TiCDC often encountered OOM issues when pulling data from a high-write cluster, affecting TiKV’s load by about 20%. Through iterations and optimizations with the CDC team, the impact on TiKV was reduced to within 5%. When TiCDC syncs downstream, there were some performance bottlenecks, such as fast pulling but slow syncing, leading to high memory consumption and OOM. Later, CDC made some modifications to land sorted data into files, avoiding some OOM issues. Through these optimizations, TiCDC gradually became stable and usable.
Another solution is the adaptive cluster solution. It involves a primary-backup data center within the same data center, using an adaptive method for raft-level synchronization, with options like three centers in the same city or five centers in different locations, distributing the TiDB cluster across different IDCs or clouds.
The solution we ultimately chose is based on the Placement rule. We have three IDCs, all using certain cloud services. Two are online clusters capable of handling online read-write traffic, each with TiDB Server nodes, PD nodes, and TiKV nodes, and one is a disaster recovery cluster. Using a 2-2-1 replica configuration with five replicas, we distribute replicas across different IDCs using the placement rule, combined with follow read or steal read methods to handle online traffic. This ensures the entire cluster remains available even if one online data center goes down. Recently, some major cloud providers and well-known apps have experienced outages. When discussing the future of databases, multi-cloud active-active is a direction that must be addressed and developed.
2 Database Stability Construction
Next, let’s talk about the stability construction of our database, specifically TiDB’s stability construction. TiDB stability construction is divided into several aspects:
2.1 Observability
Enhancing observability is a very effective way to discover and locate problems. What does observability include? Generally, it includes metrics, traces, and logs.
First, let’s talk about metrics. When I first joined Zhihu, I found that there were many monitoring items and many cluster versions. If you’ve read the documentation, you know that a TiDB cluster has about 90 monitoring items. Each cluster has so many monitoring items, and each cluster’s load is different, so the alarm thresholds triggered may also be different. The company uses the same template to render monitoring items, which means that different load clusters under the same alarm threshold will cause my phone to constantly receive warning messages at night, making it impossible to sleep. If the metric problem is not solved, it can be very painful.
So, how did we solve this problem?
First, we set different alarm thresholds for various clusters, tailored to each one. Different cluster versions have different metric expressions, and their thresholds vary across versions. You need to dynamically adapt, create a configuration table, and instantiate the monitoring items for each cluster’s set of ninety indicators. Then, for alarm configuration, you can do some alarm aggregation. Previously, alarms would be sent in bulk, all the same alarm. For example, a schema error alarm might be used by multiple people and teams within the same cluster. After aggregation, only one alarm is sent to us. How to do alarm aggregation, preset alarms, and customize alarms for different people are all part of the metric work. By optimizing these threshold alarms, we reduced the number of critical alarms from a peak of 1,500 to dozens. With only a few dozen alarms per week, divided by 7 days, there are only a few per day. Spread out over time, there are fewer alarms at night, allowing DBAs to sleep better.
The second is trace+log together. Trace+log is a very good means of locating problems in TiDB. Zhihu is all-in on K8s, using TiDB operator for overall management. For this architecture, if you want to store logs, you need to bind PVC to the corresponding PV. Storage needs to bind PV, and logs also need to bind PV. However, our logs are not bound to PV because the data disks are used for TiKV, and logs are recommended to use dynamic cloud disks, but due to cost, we did not configure them. So, previously, logs were only written to the Pod’s 100M standard output for polling. Sometimes, when a cluster’s logs are particularly large, if the logs are not persisted to disk, they can only be retained for a few minutes, maybe 10 minutes. If a problem occurred 10 minutes ago, there would be no logs to check later.
During fault review meetings with business lines, the business would argue, “Why don’t you have that log?” When a problem cannot be diagnosed due to the lack of logs, it may lead to fault responsibility. So, what needs to be done? We created an ELK system, using filebeat to collect various logs, including TiDB error logs, slow logs, PD + TiKV logs, and more. These logs are collected into Kafka. In ELK, logstash parses the logs using Lua. However, for TiDB’s slow logs, you need to watermark certain types of SQL to form a single SQL for slow log aggregation, which is difficult to achieve with Lua. So, we wrote our own aggregation logic to store these slow logs.
2.2 Resource Isolation
Another aspect of stability construction is resource isolation. Previously, our resource isolation method was quite rough. Why rough? Because the previous method was to use a large TiDB cluster, with each business line creating databases within the cluster. If resource isolation is not done well, db1’s requests can affect db2, causing many problems.
Previously, everyone used the same batch of TiDB Servers. The problem was that if some business used large SQLs that consumed a lot of resources, it would cause stability fluctuations in the cluster. In fact, computing resources can be isolated by creating separate TiDB Server clusters for different businesses, achieving isolation on the computing side.
Additionally, TiDB has various SQL suppression strategies. For example, Max execution time (maximum execution time) means that once a DML or Select SQL exceeds the time slice, it is killed without mercy. This is a very strict method. Another is setting SQL quotas. You can set various SQL quotas to ensure that business SQLs use minimal resources. You can also create automated SQL killing programs to kill SQLs before they reach the maximum execution time, similar to pt-kill, which kills specific users or types of SQL. These are very detailed aspects of stability construction.
In a mixed-use cluster, how to achieve resource isolation on the storage side (TiKV)? The first method is to split the cluster. If a mixed-use cluster causes fluctuations affecting others, split it out into its own cluster to avoid affecting the core cluster. Previously, all S, A, B, and C-level business levels were mixed, so deployment planning was needed. Each S-level business gets its own cluster, and a large A-level business line gets three clusters. All B and C-level businesses are thrown into an old cluster to play around. This solution involves a lot of database migration to avoid having various business-level databases in one large cluster. After business splitting, SLA commitments can be made.
Each business line has its own SLO indicators, and DBAs are required to meet SLA indicators, such as committing to 99.95% availability. This 99.95% availability is only given to S-level businesses. For non-S-level businesses, we only guarantee 99.9% availability. After splitting the clusters, we can confidently make commitments to the business and ensure our SLA. This is the most core indicator in the DBA industry: stability is paramount. No matter how well other things are done, if three P0 DBs fail this year, your performance will not be good.
3 Tianqiong Platform
After discussing multi-cloud active-active and stability construction, let’s talk about a platform Zhihu has been working on for the past two years—the Tianqiong Platform. What is the Tianqiong Platform for? It’s similar to TiEM plus TiDB dashboard, integrated into one. The benefit of this platform is that we want to manage the entire lifecycle of TiDB, from creation to migration, daily operations, and decommissioning, all through a single platform, using a point-and-click interface for operational control.
We created this platform, collected its metadata, and then performed migrations, management, leader switching, log viewing, slow SQL reports, SQL killing, monitoring, and alarms, all integrated into one platform. Both DBAs and business users can use this platform. Business users can create clusters, view cluster QPS, and check if TiDB is causing business fluctuations by logging into the cluster and viewing cluster monitoring. For example, combined with FinOps, businesses can check TiDB costs. Now everyone is focused on cost reduction and efficiency, so they can see if there are any large tables that are no longer in use or if resource usage exceeds their estimates. Since we allocate resource costs by GB to the business, they can adjust their resource quotas and write strategies through the platform.
They can also handle daily on-call tasks through the platform, such as submitting an on-call request in Jenkins for DBAs to expand capacity or perform other operations. Everything is handled through the Tianqiong Platform, with self-service processing, layer-by-layer approval, and complete lifecycle management of TiDB.
This part will share some thoughts from my 13 years of experience in the database industry. In my view, the development of databases can be discussed from three points:
1 Database Product Strength
Database product strength lies in what functions the database achieves. For example, ACID consistency, support for distributed transactions, multi-tenancy, resource isolation, etc., all belong to product strength.
For example, TiDB has always emphasized high performance and smooth scaling, which is its product strength. Another important aspect of product strength is stability. Recently, TiDB has focused more on stability in feature iterations, ensuring sufficient stability under the same workload.
So, product strength is about what functions your product achieves, its stability, performance, and features. These are the basic elements of a product. With this foundation, others can recognize and use you. Additionally, the product needs an open and complete ecosystem.
2 Database Ecosystem
The ecosystem involves how data is migrated from another database to TiDB, including DM components, early tools based on mydumper and myloader, and Lightning. I’ve used all these tools and can flexibly apply them to different business scenarios. For example, for offline data migration, I can directly dump and load into TiDB. For a new cluster, I can use Lightning to generate KV pairs at a low level and quickly import them into TiDB, as it’s a new cluster with no load.
As mentioned above, data migration is crucial. As an open-source distributed database, data must be able to come in and go out. Locking data in after migration is unacceptable to internet companies that favor open-source DBs.
How does data go out? There are two ways: the first is the early TiDB binlog: Dumper + Drainer. Dumper pulls data from TiDB Server and aggregates it, then Drainer sends it downstream. Later, we replaced TiDB binlog with TiCDC, which timely exports incremental data changes.
Another ecosystem tool is BR for backups. Database backups are crucial. If your database is deleted, you can still recover data from S3 or another location. In TiDB, even if you encounter a delete operation, as long as you have set sufficient time, various Flashback capabilities can help recover data.
There are also monitoring tools. TiDB uses CNCF’s Prometheus and Grafana for monitoring, including TiDB dashboard, which provides insights into cluster regions, heatmaps, topology, and diagnostics.
Additionally, there are deployment-related tools. I’ve experienced the TiDB ansible era, the tiup era, and currently using TiDB operator. These tools are part of the historical flow, continuously evolving.
Finally, there are platform-side tools, such as the Tianqiong Platform mentioned earlier, or your own developed TiDB management platforms. These are all part of the database ecosystem.
Moreover, there is the developer ecosystem. For example, developing a JuiceFS based on TiKV or creating your own KV system based on TiKV. These are all part of the database ecosystem, enabling data to come in, be managed, and go out. A database with a robust ecosystem can be fully utilized.
3 Database Scenarios
Scenarios refer to the problems the database solves for users. TiDB has always emphasized solving the pain point of MySQL sharding, which is a significant scenario. Many large and medium-sized companies have this pain point. If you can solve it stably, combined with strong product strength and ecosystem, you can effectively address database adoption.
Later, TiDB introduced HTAP scenarios. In my view, OLTP and OLAP are inseparable, just varying in proportion. For example, when I order dinner, the merchant’s transaction system is an OLTP requirement. This OLTP requirement also includes some OLAP needs. The merchant needs to know which dishes sell best within the last half hour. With 10 chefs, they might use real-time OLAP to analyze which dishes sell best and allocate chefs accordingly. This process can occur within a single database or through near-real-time ETL, but HTAP has certain application scenarios.
Additionally, there are multi-tenancy scenarios. Multi-tenancy is the best tool for solving mixed-use clusters, but this feature is only available in 7.x versions, while we are still on 4.x.
Other scenarios include resource isolation, data distribution, and cheaper storage engines. For example, data distribution strategies can place frequently accessed hot data on NVMe SSDs and cold data on HDDs or S3. Or, with low QPS, use S3 as a storage engine to solve some usage issues.
The first direction is AI x DB. One scenario I’ve seen a lot is using NLP to convert natural language into SQL. This can help non-SQL-savvy product and operations teams query data more easily.
The second direction is AIOps. For example, using AI to analyze several GBs or tens of GBs of alarm logs to identify cluster issues, including access logs, error logs, and slow logs. For slow logs, AI can provide SQL suggestions based on table and index usage. When a database encounters a fault, AIOps can automatically repair it, providing self-healing capabilities. These are some directions for AI x DB.
The third direction is multi-modal and multi-workload databases. What is multi-modal? At the core, databases can store data as KV. We have relational databases, graph databases, document databases, and various KV databases. In the future, can we develop databases like TiRedis, TiMongo, TiHbase, etc.? These are conceivable possibilities. As long as you get the underlying storage right and the upper computation layer right, it can be achieved. Another direction is multi-workload, which is what TiDB is currently promoting with TiDB Serverless, where you pay as you use.
These are my thoughts on the future directions of database development, hoping to inspire everyone.
Finally, let’s review the Annual Beijing Offline New Year Tea Gathering.