Let's discuss how to prevent failures as a DBA

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 做dba的故障大家来说说怎么防范.

| username: tidb狂热爱好者

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Encountered Issues】
Issue 1
A developer deployed a SQL that suddenly crashed the production database.
Issue 2
Operations moved the server and deleted all TiKV nodes.
Issue 3
The data center lost power, and PD couldn’t start up.

Isn’t it bizarre? All these strange things happened to me.

【Reproduction Path】What operations were done that caused the issue
【Problem Phenomenon and Impact】

【Attachments】

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

| username: tidb狂热爱好者 | Original post link

I originally wanted to slack off by writing code, but ended up being extremely busy.

| username: tidb狂热爱好者 | Original post link

Wasting all the time that could be spent on dating on TiDB.

| username: ddhe9527 | Original post link

In banks, management is generally done as follows:

  1. Implement disaster recovery and high availability at the architecture level.
  2. Include an SQL review step in the business deployment process.
  3. Monitor the production database closely to promptly detect slow queries or faults.
  4. Manage the permissions of operations personnel properly and authorize through an operations work order approval process.
| username: cs58_dba | Original post link

Generally, it is important to implement proper separation of duties and have dual review for critical operations.

| username: xuexiaogang | Original post link

Question 1:
This is about setting audit rules. Before going live, everything must be audited. Use tools if available; if not, find a way, even if it means manual review.

Question 2:
Changes must go through the relevant process for review. It should be possible to identify whether these commands are high-risk. If no one can identify them, it means everyone is being careless. Problems are bound to happen, but generally, they won’t.

Question 3:
Infrastructure should have UPS, generators, etc. If these are used and still don’t work, database disaster recovery needs to be considered. Compared to the previous issues, this is less likely to happen.

| username: tidb狂热爱好者 | Original post link

We still need everyone to brainstorm.

| username: xuexiaogang | Original post link

The first two points are beyond doubt. Many companies do this, and so do we.

| username: tidb狂热爱好者 | Original post link

First, when the situation occurred, the auditing tool was already in place, and the SQL was embedded within the Spring Boot framework.
Second, the process is definitely more standardized than most companies, as payment companies are subject to more attacks.

| username: ablewang_xiaobo | Original post link

  1. Before deploying SQL in the production environment, it is best to test it in a testing environment. If there is no testing environment, it is advisable to conduct a review.
  2. In a TiDB cluster, if you scale-in and shrink the TiKV nodes, the nodes will be in an offline state, but the cluster will still be usable. You just need to add new nodes, transfer the data to the new nodes, and then remove the offline ones. If the nodes are physically deleted, recovery can only be done through backups, so backups are very important.
  3. After a power outage in the data center, if the power comes back and automatic startup is enabled, the cluster will automatically start. If it doesn’t start, manual intervention is required. These issues are unavoidable and highlight the importance of a DBA.
| username: 啦啦啦啦啦 | Original post link

  1. Shifting blame to developers
  2. Shifting blame to operations
  3. Shifting blame to data center operations
    :joy:
| username: Mark | Original post link

System software internal mechanism failure: unavoidable, ensure excellence in every stage including selection, testing, delivery, and maintenance.

Human operation failure: manage through access control, detailed permission management, operation records, standardized operation procedures, and AB review.

| username: tidb狂热爱好者 | Original post link

This kind of crash is actually not much related to the machine, but rather a black swan event. A process is needed to avoid failures.

| username: Jiawei | Original post link

In summary, just two words: standardization + high availability.

| username: cs58_dba | Original post link

We will now anonymize the data and import it into the pre-production environment. If we need to go live, it must be verified in the pre-production environment.

| username: tidb狂热爱好者 | Original post link

The simplest part of the business I see is that the code in the Yunxiao DevOps system will automatically synchronize.

| username: forever | Original post link

I’ve encountered a situation where the operations team saw that the server space was insufficient and thought that the Oracle data files were taking up the most space, so they deleted them all. :sweat_smile:

| username: alfred | Original post link

Backup is indispensable.

| username: tidb狂热爱好者 | Original post link

I have encountered this as well.

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.