Tang Liu: Reflections on Product Quality - How to Evaluate Quality

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 唐刘:关于产品质量的思考 - 如何评估质量

| username: TiDB社区小助手

Introduction

In the previous article, “Thoughts on Product Quality - My Basic Understanding,” the author shared personal experiences and insights on product quality: high-quality products are not only ensured through testing but also through continuous refinement and improvement in real-world scenarios. This article is the second in the “Thoughts on Product Quality” series and uses the TiDB product release as an example to explore how to evaluate product quality. The article points out the pitfalls of evaluating quality solely based on the number of bugs found and introduces some effective evaluation methods, emphasizing the importance of deeply understanding customer business scenarios.

Every time a TiDB version is released, I am invariably asked by frontline business departments or customers, “Is this version of good quality?” Each time I encounter this question, I feel quite helpless because it’s challenging to provide a satisfactory answer. However, being asked this question repeatedly has naturally led me to ponder how to evaluate the quality of a version.

Before we begin, here are a few disclaimers:

  • What I say may not necessarily be correct. I also periodically refresh my own understanding.
  • This is merely my own thoughts on quality, summarizing my experience at PingCAP, and may not necessarily apply to other companies.
  • What I discuss here are just some aspects of PingCAP’s quality evaluation; we have more evaluation dimensions and metrics internally.

1. Number of Leaked Bugs

I often discuss product quality issues with many friends, and sometimes I hear a very common response, “Isn’t it simple? Just look at how many bugs customers encounter.” This isn’t entirely unreasonable, but as I mentioned in “Thoughts on Product Quality - My Basic Understanding,” having many bugs doesn’t necessarily mean poor quality.

Here, I want to emphasize another point: the bugs people usually discuss are leaked bugs, meaning bugs that have already leaked to customers. Using the number of leaked bugs to evaluate the quality of a product is already quite delayed, especially for products like databases. Based on my years of observation, users are generally reluctant to upgrade databases. In some domestic financial customers, new versions are adopted on an annual basis, making the feedback cycle too long. Of course, providing TiDB’s new version services on the cloud can accelerate this feedback process, but there will still be some delay.

However, this metric is still very significant and valuable. Leaked bugs are a lagging indicator for the current released version, but they are a real leading indicator for the next version to be released. In other words, in the next released version, we should try to fix the bugs leaked from the previous version. If we don’t do this, quality can easily get out of control.

2. Bug Convergence

We mentioned the number of leaked bugs above. In reality, during the development version, we also test and find many bugs ourselves. If these bugs are not fixed, they will still affect our product quality. To assess quality risks, we usually focus on whether bugs are converging. Here, let me explain bug convergence a bit. Generally, there are two curves for bugs: one for the number of open bugs and one for the number of closed bugs. For a rapidly iterating system, the number of open bugs is usually greater than the number of closed bugs. Over time, if this difference keeps increasing without showing a convergence trend or if the difference is controlled within a small range, we consider the overall product quality to be at risk.

Here is an example of a well-done product: Kubernetes. After a few years of release, the number of closed bugs has exceeded the number of open bugs. As shown in the figure:

Reference: https://ossinsight.io/analyze/kubernetes/kubernetes#issues

Another excellent example is vscode. Almost from the beginning, the difference between the two curves has been controlled within a small range:

Reference: https://ossinsight.io/analyze/microsoft/vscode#issues

For TiDB, starting from version 7.5, we strictly control bug convergence when releasing LTS versions. In previous 6.x versions, we needed to release several patch versions to ensure the closed bug curve exceeded the open bug curve. However, by the time we released version 7.5, we were already able to ensure bug convergence. But I still want to emphasize that bug convergence does not mean there are no bugs; it just proves that quality has not deteriorated.

In the future, for the TiDB product, we will start controlling bug convergence in the mainline version. We have an ambitious goal: to truly achieve mainline releases for TiDB. We believe that bug convergence is a necessary condition for mainline releases.

3. Bug Clustering Effect

The bug clustering effect is a term I coined myself, which might still be confusing. Here is an example that can leave a deep impression:

Usually, when we see a cockroach at home, we can already anticipate that there are many more cockroaches at home.

Note: Since directly posting a real image might cause discomfort for most people, a cute image is used instead.

In my understanding, bugs are similar. “When we find a bug in a module during testing, there is a high probability that there are more bugs in that module.

I am not sure if there is any theoretical basis for this understanding; it is just a fun realization based on my years of experience. Of course, based on this understanding, we can sometimes derive an even more interesting realization:

When a development team previously developed a feature with many bugs, their next feature is also likely to have quite a few bugs.” Of course, things will improve over time, and as the development team grows, the quality of the features developed by the team will be greatly guaranteed.

Therefore, when releasing a version, if we want to ensure that the quality of the version is controllable, with limited resources, we should focus on modules where bugs were previously found and features developed by development teams that wrote many bugs over a period of time.

4. Feature Bandwidth Allocation and T-shirt Size

After discussing bugs, let’s talk about features. In the previous article, “Thoughts on Product Quality - My Basic Understanding,” it was mentioned that the more features developed, the more bugs there will be to some extent, which is not subject to the will of the development engineers. Of course, we cannot stop developing features; otherwise, we would lose long-term competitiveness.

So, the first thing we need to do is control the number of features, striving to balance competitiveness and quality. At PingCAP, our R&D leaders reach a consensus with PMs and then evaluate the team’s bandwidth investment for a period of time. For example, 40% of the team’s bandwidth is invested in developing new features, 40% in quality improvement and architectural refactoring, and the remaining bandwidth is used for on-call, customer support, or personal growth-related matters. At PingCAP, different R&D teams have different bandwidth allocation ratios at different stages.

After planning the bandwidth allocation, the R&D leader uses the traditional T-shirt Size to evaluate how many person-days are needed to develop a feature. For example, if a feature’s size is XL, it means 1 person-month.

After planning these, from a macro perspective, including but not limited to the following situations, there may be quality risks:

  • A large number of features, especially features above XL size
  • A development engineer involved in multiple features, especially features above XL size
  • The person in charge of a feature above XL size is a relatively junior development engineer
  • A team, especially a team working on the TiDB kernel, has a high bandwidth ratio for feature development, such as over 60%

Of course, if we focus on a specific feature, if the feature specification is not clearly defined, there is no test plan, and the corresponding PR code changes are many, these will all be quality risks for the feature itself, which we also need to pay attention to. We can discuss this further later.

5. Test Coverage of Customer Scenarios

I have a dream: “I have unlimited resources to write unlimited test cases, covering all customer scenarios, so TiDB will have almost no bugs.” This dream is so grand that it makes me fully aware that I am daydreaming.

So, how can we more efficiently add test cases to cover more customer scenarios, ensuring that the versions we release can work normally most of the time and not scare customers? This is indeed a very challenging task.

Fortunately, the 80/20 rule still applies here - at PingCAP, 20% of customers contribute almost 80% of the issues (including not only bugs but also on-call issues, etc.). Additionally, we found that the scenarios of these 20% of customers can be replicated to other customers in other industries.

This gives us a good guideline - with limited resources, as long as we can deeply understand the current business scenarios of our 20% important customers and develop test cases based on these scenarios, we can at least ensure that most scenarios at the current stage will not fail. The higher the test coverage rate of the 20% customer scenarios, the more confident we are in the quality. Of course, how to simulate business scenarios is something we can discuss further. When we feel that the coverage rate of the 20% customer scenarios is good, we will gradually accumulate more scenarios.

Another fortunate discovery is that many of our current bugs come from new features being directly used by important customers, especially in North America. To some extent, this is a good thing, as it shows that many customers are willing to try our new versions directly. So, when developing new features in the future, we will deeply cooperate with these customers, understand their business scenarios, and add test coverage to ensure the quality of newly released features.

Conclusion

The above are just some of my personal intuitive perspectives on evaluating product quality. In reality, at PingCAP, the metrics we use to measure product quality are far more than the above points, as our product is a database, and the quality requirements are very high.

I have only discussed how I evaluate product quality from the perspectives of testing and bugs, without touching on code. In my understanding, code with high complexity is likely to have poor quality and is likely to have bugs. We can discuss the relationship between code complexity and quality later.

Regarding quality, I have another insight: even for the same TiDB version, we may hear different feedback from different customers, and even different feedback from the same customer. This is not surprising. Different customers may have different scenarios, and even the same customer may have different business scenarios. Currently, TiDB does not cover all possible scenarios, and we can only gradually supplement test cases for different scenarios.

Finally, let me mention a piece of data from Oracle. I once discussed with some colleagues from Oracle, “How many customer scenario tests do you have at Oracle?” Most of the answers were over 200. This number surprised me greatly, although the number might not be entirely accurate. But if it is true, Oracle’s hundreds of business scenario test cases have likely abstracted customer scenarios to a great extent, covering most of their customer base. This is the current gap between TiDB and Oracle in terms of test scenarios, and we can only gradually accumulate experience and strive to catch up.

References