Let's discuss how to prevent failures as a DBA

translator_bot · June 23, 2024, 10:17am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 做dba的故障大家来说说怎么防范.

| username: tidb狂热爱好者

【TiDB Usage Environment】Production Environment
【TiDB Version】
【Encountered Issues】
Issue 1
A developer deployed a SQL that suddenly crashed the production database.
Issue 2
Operations moved the server and deleted all TiKV nodes.
Issue 3
The data center lost power, and PD couldn’t start up.

Isn’t it bizarre? All these strange things happened to me.

【Reproduction Path】What operations were done that caused the issue
【Problem Phenomenon and Impact】

【Attachments】

Please provide the version information of each component, such as cdc/tikv, which can be obtained by executing cdc version/tikv-server --version.

translator_bot · June 23, 2024, 10:17am

| username: tidb狂热爱好者 | Original post link

I originally wanted to slack off by writing code, but ended up being extremely busy.

translator_bot · June 23, 2024, 10:17am

| username: tidb狂热爱好者 | Original post link

Wasting all the time that could be spent on dating on TiDB.

translator_bot · June 23, 2024, 10:17am

| username: ddhe9527 | Original post link

In banks, management is generally done as follows:

Implement disaster recovery and high availability at the architecture level.
Include an SQL review step in the business deployment process.
Monitor the production database closely to promptly detect slow queries or faults.
Manage the permissions of operations personnel properly and authorize through an operations work order approval process.

translator_bot · June 23, 2024, 10:17am

| username: cs58_dba | Original post link

Generally, it is important to implement proper separation of duties and have dual review for critical operations.

translator_bot · June 23, 2024, 10:17am

| username: xuexiaogang | Original post link

Question 1:
This is about setting audit rules. Before going live, everything must be audited. Use tools if available; if not, find a way, even if it means manual review.

Question 2:
Changes must go through the relevant process for review. It should be possible to identify whether these commands are high-risk. If no one can identify them, it means everyone is being careless. Problems are bound to happen, but generally, they won’t.

Question 3:
Infrastructure should have UPS, generators, etc. If these are used and still don’t work, database disaster recovery needs to be considered. Compared to the previous issues, this is less likely to happen.

translator_bot · June 23, 2024, 10:17am

| username: tidb狂热爱好者 | Original post link

We still need everyone to brainstorm.

translator_bot · June 23, 2024, 10:17am

| username: xuexiaogang | Original post link

The first two points are beyond doubt. Many companies do this, and so do we.

translator_bot · June 23, 2024, 10:17am

| username: tidb狂热爱好者 | Original post link

First, when the situation occurred, the auditing tool was already in place, and the SQL was embedded within the Spring Boot framework.
Second, the process is definitely more standardized than most companies, as payment companies are subject to more attacks.

translator_bot · June 23, 2024, 10:17am

| username: ablewang_xiaobo | Original post link

Before deploying SQL in the production environment, it is best to test it in a testing environment. If there is no testing environment, it is advisable to conduct a review.
In a TiDB cluster, if you scale-in and shrink the TiKV nodes, the nodes will be in an offline state, but the cluster will still be usable. You just need to add new nodes, transfer the data to the new nodes, and then remove the offline ones. If the nodes are physically deleted, recovery can only be done through backups, so backups are very important.
After a power outage in the data center, if the power comes back and automatic startup is enabled, the cluster will automatically start. If it doesn’t start, manual intervention is required. These issues are unavoidable and highlight the importance of a DBA.

translator_bot · June 23, 2024, 10:17am

| username: 啦啦啦啦啦 | Original post link

Shifting blame to developers
Shifting blame to operations
Shifting blame to data center operations

translator_bot · June 23, 2024, 10:17am

| username: Mark | Original post link

System software internal mechanism failure: unavoidable, ensure excellence in every stage including selection, testing, delivery, and maintenance.

Human operation failure: manage through access control, detailed permission management, operation records, standardized operation procedures, and AB review.

translator_bot · June 23, 2024, 10:17am

| username: tidb狂热爱好者 | Original post link

This kind of crash is actually not much related to the machine, but rather a black swan event. A process is needed to avoid failures.

translator_bot · June 23, 2024, 10:17am

| username: Jiawei | Original post link

In summary, just two words: standardization + high availability.

translator_bot · June 23, 2024, 10:18am

| username: cs58_dba | Original post link

We will now anonymize the data and import it into the pre-production environment. If we need to go live, it must be verified in the pre-production environment.

translator_bot · June 23, 2024, 10:18am

| username: tidb狂热爱好者 | Original post link

The simplest part of the business I see is that the code in the Yunxiao DevOps system will automatically synchronize.

translator_bot · June 23, 2024, 10:18am

| username: forever | Original post link

I’ve encountered a situation where the operations team saw that the server space was insufficient and thought that the Oracle data files were taking up the most space, so they deleted them all.

translator_bot · June 23, 2024, 10:18am

| username: alfred | Original post link

Backup is indispensable.

translator_bot · June 23, 2024, 10:18am

| username: tidb狂热爱好者 | Original post link

I have encountered this as well.

translator_bot · June 23, 2024, 10:18am

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.