Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 做dba的故障大家来说说怎么防范.
【TiDB Usage Environment】Production Environment
【TiDB Version】
【Encountered Issues】
Issue 1
A developer deployed a SQL that suddenly crashed the production database.
Issue 2
Operations moved the server and deleted all TiKV nodes.
Issue 3
The data center lost power, and PD couldn’t start up.
Isn’t it bizarre? All these strange things happened to me.
【Reproduction Path】What operations were done that caused the issue
【Problem Phenomenon and Impact】
【Attachments】
Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.
I originally wanted to slack off by writing code, but ended up being extremely busy.
Wasting all the time that could be spent on dating on TiDB.
In banks, management is generally done as follows:
- Implement disaster recovery and high availability at the architecture level.
- Include an SQL review step in the business deployment process.
- Monitor the production database closely to promptly detect slow queries or faults.
- Manage the permissions of operations personnel properly and authorize through an operations work order approval process.
Generally, it is important to implement proper separation of duties and have dual review for critical operations.
Question 1:
This is about setting audit rules. Before going live, everything must be audited. Use tools if available; if not, find a way, even if it means manual review.
Question 2:
Changes must go through the relevant process for review. It should be possible to identify whether these commands are high-risk. If no one can identify them, it means everyone is being careless. Problems are bound to happen, but generally, they won’t.
Question 3:
Infrastructure should have UPS, generators, etc. If these are used and still don’t work, database disaster recovery needs to be considered. Compared to the previous issues, this is less likely to happen.
We still need everyone to brainstorm.
The first two points are beyond doubt. Many companies do this, and so do we.
First, when the situation occurred, the auditing tool was already in place, and the SQL was embedded within the Spring Boot framework.
Second, the process is definitely more standardized than most companies, as payment companies are subject to more attacks.
System software internal mechanism failure: unavoidable, ensure excellence in every stage including selection, testing, delivery, and maintenance.
Human operation failure: manage through access control, detailed permission management, operation records, standardized operation procedures, and AB review.
This kind of crash is actually not much related to the machine, but rather a black swan event. A process is needed to avoid failures.
In summary, just two words: standardization + high availability.
We will now anonymize the data and import it into the pre-production environment. If we need to go live, it must be verified in the pre-production environment.
The simplest part of the business I see is that the code in the Yunxiao DevOps system will automatically synchronize.
I’ve encountered a situation where the operations team saw that the server space was insufficient and thought that the Oracle data files were taking up the most space, so they deleted them all.
I have encountered this as well.
This topic was automatically closed 1 minute after the last reply. No new replies are allowed.