How to Ensure Stable Operation of TiDB in Production

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 请问一下如何才能保证生产的tidb稳定运行

| username: tidb狂热爱好者

How to Avoid TiDB Business Anomalies in Production
Here is my current design:
Implement processes to prevent non-standard designs by developers.
Many developers are beginners; they only focus on implementation and not on database issues that can cause bottlenecks.
If you don’t do a good job at the beginning, the result will be your resignation.
I am concerned about two aspects:

  1. Data volume
  2. Read/write volume
  3. Enforce a TTL (Time to Live) default limit of one day, which can be extended to 30 days.
    TTL doesn’t mean deleting data but separating hot data from cold data. According to research, after the big data boom in 2008, big data gradually cooled down, and various NoSQL and SQL solutions returned to NewSQL. It’s not that data isn’t important; it’s that 99% of the data in business databases is never accessed.
    Transaction data is retained for three months, after which it is no longer available for queries. Even data from three months ago is likely only used to review orders from a few days ago. As a tech professional, the most important skill is communication. Reaching a compromise with the business is crucial, rather than trying to support hundreds of petabytes of TiDB to showcase technical prowess.

To ensure the stable operation of TiDB in production, you can take the following measures:

  1. Set parameters and prepare an appropriate deployment plan (machines, memory, SSDs, etc.) in advance based on business access volume and data volume.

  2. Design reasonable tables and indexes before business goes live, and distinguish between AP and TP traffic.

  3. Regularly train developers and require strict adherence to the developer manual ([Column - Loading | TiDB Community](专栏 - 加载中 | TiDB 社区 business SQL going live must be reviewed and approved by an administrator).

  4. If conditions permit, add a pre-SQL review platform to standardize such issues; post-implementation, regularly analyze cluster slow queries and performance inspections, locate corresponding business SQL, and optimize it or the business.

  5. Improve monitoring.

Ensuring the stable operation of the database is a comprehensive and systematic task, with many aspects that can be continuously summarized in practice. Essentially, a database is about read/write transactions. If the data volume read is large and the disk I/O is maxed out, the database will become stuck.

I will establish SQL usage guidelines, specifying what can and cannot be done. For example, varchar(255) should not exceed 255 characters. For large texts, use S3. Once the table structure restricts development, erroneous SQL written by developers will only cause high CPU usage, not I/O issues. High CPU usage is easier to solve than high I/O issues. TiDB is inherently highly available and scalable, so expanding it is straightforward. However, expanding I/O on the cloud is much more costly, which is my personal understanding.

Once you solve the I/O issue, everything else becomes manageable. Let me share a small story. I once had a DBA colleague who was very capable and technically proficient, and he was also friendly and good at dealing with people. However, he was fired for taking the blame because he didn’t know how to refuse developers’ requests. The database he maintained grew to 100TB, and it was an Oracle database with mechanical equipment back in 2008. The read/write speed was only 80MB, and even with RAID5, it was only around 500MB. As the database grew, the business became unusable. Although he was an ACE and proficient in various optimization methods, it was too late.

| username: Kongdom | Original post link

:rofl: I’m virtualizing physical machines here, and any batch write operation causes the IO to spike to 100%. How can I solve this situation?

| username: tidb狂热爱好者 | Original post link

In fact, the TTL expiration design of the database is also very good. I strictly require developers to use TTL expiration, with the default being 1 day, and some tables are strictly required to be 1 hour. The amount of data in the database is reduced.

| username: tidb狂热爱好者 | Original post link

I have already answered your question: limit IO.

| username: 啦啦啦啦啦 | Original post link

Limit the IOPS of the virtual machine?

| username: Kongdom | Original post link

:joy: It took an hour to insert 10 million records. If IO is further restricted, it won’t support my business.

| username: Kongdom | Original post link

:rofl: Did I not describe it clearly? Why is it about limiting IO instead of increasing IO settings?

| username: 啦啦啦啦啦 | Original post link

If not limited, this virtual machine might occupy the entire IO of the host machine, affecting other services.

| username: magic | Original post link

Isn’t it a bit excessive to mandate TTL?

| username: ealam_小羽 | Original post link

We also feel that there are similar issues. The IO spikes easily when writing to the cloud, often reaching peaks of 80-90. The operations team told me that it’s not a big problem if it doesn’t exceed 100, but I feel that some slow queries with indexes might also be affected by this.

Yesterday, I was reading “Designing Data-Intensive Applications,” which mentioned that the write bandwidth of LSM-Tree and the background data compression bandwidth are shared, so they might affect each other. I’m not sure if this issue is more severe in the cloud.

| username: jaybing926 | Original post link

Keep it up, you are the best!

| username: xingzhenxiang | Original post link

Keep it up, you are the best.
Since I don’t have a virtual machine here, just make sure the hardware doesn’t fail.

| username: Jellybean | Original post link

To ensure the stable operation of TiDB in production, the following measures can be taken:

  1. First, set parameters and prepare an appropriate deployment plan (machines, memory, SSD, etc.) in advance based on business traffic and data volume.

  2. Design reasonable tables and indexes before business goes online, and pay attention to distinguishing and isolating AP and TP traffic.

  3. During business development, conduct regular training and strictly follow the developer manual (专栏 - TiDB 数据库开发规范 | TiDB 社区). Any new business SQL must be reviewed and approved by an administrator.

  4. If conditions permit, establish a SQL pre-review platform to standardize such issues beforehand; afterward, regularly analyze cluster slow queries and performance inspections, locate corresponding business SQL, and optimize them or the business.

  5. Improve monitoring.

Ensuring the stable operation of the database is a comprehensive and systematic task, and there are many aspects that can be continuously summarized in practice.

| username: xingzhenxiang | Original post link

To ensure the stable operation of the TiDB system, the following measures can be taken:

Hardware and Network Requirements: Ensure that the hardware meets the minimum configuration requirements of TiDB, and that the network bandwidth and latency can meet business needs.

Database Design and Optimization: Design the database structure and indexes reasonably to avoid performance issues caused by full table scans and large data writes. Additionally, regularly optimize and adjust the database to ensure its performance is always at its best.

High Availability Architecture Design: Use TiDB’s high availability architecture (such as TiDB + PD + TiKV) to ensure system high availability and fault tolerance, avoiding single points of failure.

Monitoring and Alert Mechanisms: Establish comprehensive monitoring and alert mechanisms to monitor the status and performance indicators of the TiDB system in real-time, and promptly detect and handle anomalies to ensure the stable operation of the system.

Security Management: Strengthen the security management of the TiDB system, including access control, risk assessment, and vulnerability patching, to prevent security vulnerabilities from being exploited and to ensure the stability and security of the system.

| username: liuis | Original post link

  1. Strictly control the development and design of the database, including table structure design, field length, index creation, etc.
  2. Add cache to the program, especially for read-heavy and write-light scenarios.
  3. Regularly check for slow SQL in the service and optimize the program wherever possible.
  4. Implement robust database monitoring and alerting.
  5. Ensure comprehensive service governance, including rate limiting and circuit breaking, to prevent a large number of requests from penetrating to the database layer.
  6. Design TiDB services for high availability to minimize service downtime during database anomalies.
| username: xfworld | Original post link

A power outage can immediately render your ideas ineffective. :upside_down_face:

| username: Hacker_小峰 | Original post link

One more thing: don’t forget to back up your data in a timely manner and verify the usability of the data after restoration.

| username: tidb狂热爱好者 | Original post link

To ensure the stable operation of TiDB in production, the following measures can be taken:

  1. First, set parameters and prepare an appropriate deployment plan (machines, memory, SSDs, etc.) in advance based on business traffic and data volume.

  2. Design reasonable tables and indexes before business goes online, and pay attention to distinguishing and isolating AP and TP traffic.

  3. During business development, conduct regular training and strictly follow the developer manual content (专栏 - TiDB 数据库开发规范 | TiDB 社区). Any new business SQL going online must be reviewed and approved by administrators.

  4. If conditions permit, establish a pre-SQL review platform to standardize such issues beforehand; afterward, regularly analyze cluster slow queries and performance inspections, locate corresponding business SQL, and optimize them or the business.

  5. Improve monitoring.

Ensuring the stable operation of the database is a comprehensive and systematic task, and there are many aspects that can be continuously summarized in practice.

Good method.

| username: tidb狂热爱好者 | Original post link

Actually, TiDB has a write amplification issue.

| username: tidb狂热爱好者 | Original post link

This has always been a point of attack by peers. If the above work is done well, it will not cause problems.