Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: 【TiDBer 唠嗑茶话会 60 】研发写了个 SQL 查询 OLAP 库,放到首页上了,因为早高峰登录的并发把 OLAP 库压死了,怎么措辞给领导汇报这件事?
In this chat session, we want TiDBers to showcase their linguistic skills and discuss workplace phrasing. If a developer wrote an SQL query for the OLAP database and placed it on the homepage, causing the OLAP database to crash due to high concurrent logins during peak hours, how should this be reported to the leadership?
This Topic:
A developer wrote an SQL query for the OLAP database and placed it on the homepage. The OLAP database crashed due to high concurrent logins during peak hours. How should this be phrased when reporting to the leadership?
This Period’s Rewards:
Participation Award:
Participate in the topic discussion to earn 30 points~
Event Duration:
2023.2.24-2023.3.3
To verify the concurrent performance of the OLAP database, a simple test was conducted. The experiment demonstrated that the OLAP database has limited support for concurrency. For future requirements with higher concurrency, it should be used cautiously and concurrency should be strictly controlled.
Leader,
As our business continues to develop and our user base continues to grow, the software has recently been adjusted accordingly and put into trial use. However, during peak hours in the morning, the database cannot withstand the access pressure. Based on this incident, we can conclude that the current hardware can no longer provide users with a perfect experience, and there is an urgent need to upgrade the hardware.
After soliciting opinions from various parties, we have finally formed a recommendation to choose the distributed database TiDB. The reasons are as follows:
1. TiDB is an open-source distributed relational database independently designed and developed by PingCAP. It is a hybrid distributed database product that supports both online transaction processing (OLTP) and online analytical processing (OLAP). Additionally, it is a domestic database that has long been ranked first.
2. It supports horizontal scaling during peak periods to support business operations and downsizing during off-peak periods to save costs.
3. It can stably support both OLAP and OLTP operations simultaneously, preparing for future business upgrades.
4. It is compatible with MySQL 5.7 protocol and MySQL ecosystem and supports deployment both locally and in the cloud.
The boss has good news and bad news. The good news is that the website traffic has increased, but the bad news is that the database couldn’t handle the increased traffic! Now everyone is working on optimization!
Invite the developers to sincerely hold an apology meeting, and colleagues will have the opportunity to drink milk tea again.
Boss, our database cannot handle OLAP writes in high concurrency scenarios.
Dear Leader,
I would like to report an issue to you. Our R&D department recently released a feature for querying the SQL OLAP database on the homepage. During the morning peak hours, due to the high concurrent login volume, the OLAP database was overwhelmed, causing users to be unable to use the feature normally.
We are working hard to resolve this issue, including optimizing the performance of the OLAP database and improving concurrent login handling. We have taken measures to ensure that similar situations do not occur again and will continue to monitor and test to ensure the reliability and availability of this feature.
------ From ChatGPT
Incident Situation: Homepage query function experienced a large number of white screens during the morning peak.
Incident Cause: The impact of SQL query performance and concurrency on the OLAP database was not considered, leading to a high concurrency avalanche.
Incident Level: P0
Temporary Solution: Due to the use of TiDB, quickly scale out TiFlash nodes to handle the concurrency or temporarily take the function offline using a feature toggle.
Follow-up Rectification Plan: All SQL going live must be reviewed by the audit platform to assess SQL performance and concurrency after going live.
Accident reports should focus on being truthful and objective. Clearly explain the causes, measures already taken, and preventive measures.
Dear Leader,
During yesterday’s morning rush hour, one of our SQL query operations overwhelmed the OLAP database, causing it to be unresponsive to query requests. After analysis, we found that this was due to the high concurrency of the SQL query, which exceeded the maximum processing capacity of the OLAP database. We sincerely apologize for the inconvenience and impact this has caused the company.
To address this issue, we have taken the following measures:
-
We have optimized the SQL query in question to make it faster in the OLAP database, reducing query time.
-
We have upgraded and optimized the OLAP database to improve its concurrency capacity and stability.
-
We will further strengthen monitoring and alert mechanisms to promptly identify and resolve potential performance issues.
We take this matter very seriously and will take all necessary measures to ensure that such incidents do not occur again. At the same time, we welcome valuable opinions and suggestions from leaders and colleagues to help us better improve and optimize database applications.
Thank you.
Daily inspection event: Leader, our database SQL query OLAP database experienced a failure in the morning.
Cause: To fully assess the system’s SQL response to high-concurrency, high-traffic requests in a short period.
Risk assessment level: High risk
Solution: Contact TiDB engineers for assistance via phone. Extend warranty.
Hello, Leader. The traffic for our business is quite high today, and the excessive concurrency has overwhelmed the OLAP database. We can query the SQL and optimize it to make the query speed in the OLAP database faster. In terms of performance, we need to check the hardware equipment to see if we need to add some nodes and then perform load balancing.
Seek truth from facts, propose preventive measures, and provide rectification suggestions.
Leader, I have a recent requirement that has been implemented and scheduled for execution in the morning. Due to insufficient preliminary performance testing and rapid user growth, the resources consumed by the task execution have significantly increased, affecting today’s OLAP cluster. I will promptly coordinate with operations to restore the cluster and optimize the task’s SQL. Additionally, I will enhance performance testing and, based on the test results and requirements, determine if dynamic resource allocation is necessary.
Boss, how about we just run away! Just kidding, let’s identify the cause, propose optimization suggestions and solutions for the leadership to choose and decide!