Refine error messages to facilitate troubleshooting for operations and maintenance

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 细化报错信息,方便运维排错。

| username: OnTheRoad

Feedback on Requirements
Please clearly and accurately describe the problem scenario, required behavior, and background information to facilitate timely follow-up by the product team.

[Problem Scenario Involved in the Requirement]
During the operation of TiDB, various errors may occur. The error messages are often just a vague description. Official developers might be able to pinpoint the main cause of the problem at a glance based on the error message. However, users looking at the vague error message can be confused and not know where to start.
For example, during the BR recovery test in the production environment these days, an error occurred: other error: Coprocessor task terminated due to exceeding the deadline. This other error is confusing. What kind of error is this? Is it similar to the base class of exceptions in Python (Exception), where all errors can be classified as other error? What is this deadline? From my impression, the official TiKV video tutorial did not mention this deadline. Is it related to the disk scheduling queue deadline or my-deadline? I found the definition of Coprocessor task terminated due to exceeding the deadline on the official GitHub.


However, it is unclear when this error will be triggered.

[Expected Requirement Behavior]
Referencing Oracle’s ORA-XXX error number approach, provide a detailed definition of each error number, its cause, and possible solutions. For example, seeing ORA-00600 indicates an internal error, which might be due to a bug, and most issues can be resolved by looking up the parameters in support.

[Background Information]
Clearer error messages would allow users to resolve most common issues on their own based on the prompts.

| username: redgame | Original post link

Good suggestion.

| username: MrSylar | Original post link

The suggestion is good, indeed there are many error messages that are hard to understand.

| username: zhanggame1 | Original post link

Indeed, an error ID should be added.

| username: 像风一样的男子 | Original post link

It’s a good idea to refer to Oracle and create an error list for self-troubleshooting.

| username: WalterWj | Original post link

Like :blush:

| username: 春风十里 | Original post link

The suggestion to refine error reporting and optimize the code is very good, as well as the refinement of wait events.

| username: OnTheRoad | Original post link

TiDB now seems like a wild horse off the reins, with various new features that are very attractive. However, upon closer inspection, each feature is not perfect and requires more optimization. Over time, the more features accumulate, the more debt is incurred.
Personally, I feel that some features that can address user pain points should be refined and perfected, rather than creating a product that is large and comprehensive.

| username: 有猫万事足 | Original post link

Speaking of this, I have to mention an article by PingCAP’s CTO that I came across yesterday.

I’ll share it here again; it contains a lot of critiques from the CTO about their own product. It’s quite interesting. :joy:

| username: ffeenn | Original post link

This can indeed be optimized into an error explanation code.

| username: Jellybean | Original post link

The expert’s suggestion is very pertinent, I second it.

I hope the official team can allocate more resources to optimization in this area.

| username: OnTheRoad | Original post link

Great in-depth article! The description about monitoring in the article is something I’ve mentioned in posts before. TiDB Grafana has a lot of monitoring metrics (at least a hundred, right?), and when there’s an issue with the database, it’s overwhelming with so many charts and not knowing where to start. Then there’s the cluster configuration adjustments, with settings for PD, TiDB, and TiKV, some of which are dynamically adjustable and others that require configuration file changes and reloading. It’s also very confusing, with too many parameters to remember. It’s impossible to remember them all.

| username: MrSylar | Original post link

I admire TiDB for this: daring to criticize itself publicly and never disparaging competitors.

| username: zhanggame1 | Original post link

I remember I asked a parameter question a few days ago, and I was very confused. It took me several days to understand it, and now I can’t remember it clearly again. The parameters are persisted in etcd, and you can’t see them in any configuration file.

| username: OnTheRoad | Original post link

The deeper the love, the greater the expectation.