[TiDBer Chat Session 65] Win a Ti Red Camping Set, Share Common TiDB Errors & Solutions Without the Hassle!

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 【TiDBer 唠嗑茶话会 65 】赢取 Ti 红露营五件套,分享常见 TiDB 错误 & 解决不费脑!

| username: TiDB社区小助手

:fire: :fire: Ti Red Camping Gear Five-Piece Set Limited Time Drop in This Chat Session!! :fire:

All TiDBers must be asking
:person_raising_hand:: How can we get the Ti Red Camping Gear Five-Piece Set???
:raising_hand_man:: Everyone, don’t rush, let the robot introduce this session’s theme first: TiDB Database Common Errors and Corresponding Solutions Sharing Conference.
(PS: If you are really in a hurry, you are allowed to jump to the end [Ti Red Camping Gear Limited Prize] to check the conditions for obtaining it)
Don't rush

When developing or operating with TiDB, we often encounter some errors or issues. Understanding these errors and their solutions can help us better manage the TiDB database and improve work efficiency. So, this session will be a TiDB Database Common Errors & Solutions Sharing Conference! Let’s share the paths we’ve walked with more friends~

For example :chestnut::

  • “PD server timeout” error: This error usually indicates a timeout response from the PD Server (Placement Driver), which may be caused by network failures or high load on the PD cluster.
  • Solutions include adding PD cluster nodes, optimizing PD cluster performance, checking the status of PD nodes, and fixing any downtime or network issues.

Another example :chestnut::

  • “Error 1009: invalid time zone” error: This error usually indicates an incorrect time zone setting, which may be caused by issues with the operating system or TiDB configuration.
  • Solutions include modifying the time zone settings and restarting the TiDB cluster.

So, TiDBers, show off your experiences this session~ PS: When sharing, please follow the two parts below~👇

Common Errors + Solutions

:sparkles: After the topic ends, the collection of shares from everyone will be made into [Community Wisdom Dry Goods]~

This Session’s Topic:

Come and share the common errors & solutions in the TiDB database~ (PS: Both parts are required)

Activity Rewards:

Ti Red Camping Gear Limited Prize

  • The TiDBer who shares according to the format and receives the most likes💗 in this chat session will be rewarded with the Ti Red Camping Gear Five-Piece Set!
  • The TiDBer who shares according to the format and contributes the most common TiDB database errors and corresponding solutions in this chat session will be rewarded with the Ti Red Camping Gear Five-Piece Set!

Participation Award:

TiDBers who share according to the [Common Errors + Solutions] format will receive 30 points reward~

Activity Time:

2023.3.31-2023.4.7

Ti Red Camping Gear Five-Piece Set Display


| username: ShawnYan | Original post link

For example :chestnut:

| username: xfworld | Original post link

[Common Errors]
TiKV memory usage surges and does not recede


[Solutions]

  1. Check if the TiKV version is the official compiled version. If not, refer to the Linux memory allocation scheme (incorrect scheme selection may lead to memory not being released in a timely manner).
  2. Insufficient disk IO, slow write speed, causing memory write to be faster than disk flush. Enable flow control to ensure stable memory and disk flush…
  3. Slow queries push down and occupy resources for a long time, unable to release in a timely manner. Enable resource location to find these slow queries, optimize them, and reduce long-term resource occupation.
| username: Jellybean | Original post link

  • When executing SQL, especially long-running SQL, sometimes an “Information schema is changed” error occurs.

    • Principle: TiDB uses the schema at the time of SQL execution to process the SQL statement, and TiDB supports online asynchronous DDL changes. Therefore, when executing DML, there might be DDL statements executing simultaneously, and you need to ensure that each SQL statement is executed on the same schema. So, when executing DML, encountering an ongoing DDL operation might result in an “Information schema is changed” error.
    • Reason 1: The table involved in the ongoing DML and the table in the cluster where the DDL is being executed are the same.
    • Reason 2: The DML execution time is very long, during which many DDL statements are executed in the cluster, or TiDB cannot load the schema information for a long time due to network issues, causing the number of intermediate schema version changes to exceed the value of tidb_max_delta_schema_count.
  • Solution:

    • Retry the SQL after it fails; in most cases, it can recover.
    • Check the network conditions between clusters to rule out network issues.
| username: TiDBer_pFFcXLgY | Original post link

[Common Errors]
The query uses a hint but still does not use TiFlash, resulting in slow query performance.
However, after manually setting the SESSION to TiFlash, the query speed is very fast.

[Solution]
Add a hint to the subquery as well;
The table’s statistics might be inaccurate. You can manually use ANALYZE TABLE to collect statistics, which might allow the query to use TiFlash by default without adding a hint.

| username: 清风明月 | Original post link

[Common Errors]

  1. When creating Pods for each component through TiDB Operator, they often remain in a Pending state, usually getting stuck on the creation of a specific component service.
    [Solution]
    This is usually due to insufficient resources. You can check the specifics by running kubectl describe po -n ${namespace} ${pod_name}. Most of the time, the issue is caused by the PV volume not being properly bound. You can resolve this by modifying or deleting the relevant bindings.

  2. TiDB service components often encounter CrashLoopBackOff.
    [Reason]
    This is caused by the PV having data or lacking permissions when writing to the database in the PV.
    [Solution]

  3. Run kubectl describe pod/$pod -n$ns. You’ll find that the initialization fails, restarts, and fails again.

  4. By checking the logs of the pod, you’ll find related errors, generally indicating that there are files causing the initialization data to fail. The new pod cannot write data into it.

  5. Unbind the PV and PVC, delete the data in the PV, and then rebind to resolve the issue. Alternatively, directly delete the data in the bound PV. The kubelet will restart the container after about two minutes, and the pod’s status will change to running.

| username: 天蓝色的小九 | Original post link

Error accessing PD: TiKV cluster is not bootstrapped

Most of PD’s APIs can only be used after the TiKV cluster has been initialized. If you only start PD when deploying a new cluster and have not yet started TiKV, you will encounter this error when accessing PD. To resolve this error, you should first start the TiKV that you want to deploy. TiKV will automatically complete the initialization process, after which you can access PD normally.

| username: YuchongXU | Original post link

Query abnormally slow?

| username: TiDBer_AAeSUyTk | Original post link

Is there a quick way to resolve CrashLoopBackOff?

| username: 清风明月 | Original post link

Yes, during the initialization of the entire cluster, you need to start PD first, then TiKV, and finally TiDB. The relevant data persistence needs to be stored in TiKV. Once TiKV is started normally, the entire service can use PD normally.

| username: 清风明月 | Original post link

Check the StorageClass (SC) bound to the Persistent Volume (PV). When deleting, you can set the volume type to “delete”. This way, the data inside will be automatically deleted when unbinding.

| username: 啦啦啦啦啦 | Original post link

【Common Error】
A common error during scaling down: In a 3-node TiKV cluster, if one node goes down and cannot be brought back up, the scaling down process remains in an offline state and cannot proceed normally.

【Reason】
To ensure three replicas, having only 2 TiKV nodes will prevent region replicas from being scheduled, thus making it impossible to scale down normally.

【Solution】
First, scale up by adding a node to ensure the cluster meets the condition of having at least 3 nodes. Once the region replicas can be scheduled normally, proceed with scaling down the TiKV.

| username: DBRE | Original post link

[Common Errors]
Excessive TiKV metrics leading to huge Prometheus storage and multiple Prometheus restarts

[Cause]
Excessive TiKV metrics

[Solution]
Add the following lines to the Prometheus configuration file under job: tikv

metric_relabel_configs:
* source_labels: [**name**]
  separator: ;
  regex: tikv_thread_nonvoluntary_context_switches|tikv_thread_voluntary_context_switches|tikv_threads_io_bytes_total
  action: drop
* source_labels: [**name**,name]
  separator: ;
  regex: tikv_thread_cpu_seconds_total;(tokio|rocksdb).+
  action: drop

However, the reason why TiKV has so many metrics still needs to be identified by the official team. Refer to the old post: tikv状态接口输出metric过多,请问如何优化呢? - TiDB 的问答社区

| username: TiDBer_m6V1BalM | Original post link

Frequent crashes are not a database issue, but rather events where database anomalies are caused by misconfigurations in the operating system, network, or application side!

| username: Soysauce520 | Original post link

Drainer component synchronization delay: This issue is caused by an excessive business load that the drainer process cannot handle.

Solution: Split the drainer and scale out the drainer. Use the parameter syncer.replicate-do-db to synchronize different databases separately.

Config:
syncer.replicate-do-db
- database1, database2, database3

| username: ddhe9527 | Original post link

When executing SQL statements, an error occurs: 1105 - Out Of Memory Quota
Solution: You can set tidb_mem_quota_query to a sufficiently large value at the session level, for example:

set tidb_mem_quota_query=8589934592;
| username: tomxu | Original post link

[Common Errors]

  1. When using TiDB Lightning to import data, the following situations may cause the import to fail:

[Solutions]

  1. Data source error: TiDB Lightning requires the MySQL database dump file format to be compliant with specifications, otherwise, the import may fail. You can use the mydumper tool to generate compliant dump files or use other tools for data preprocessing to meet the requirements.
  2. Configuration error: The configuration file of TiDB Lightning needs to specify the correct parameters, such as MySQL database address, port, username, password, etc. If the configuration is incorrect, the import will also fail.
  3. Insufficient disk space: During the import process, TiDB Lightning temporarily stores the imported data on the disk and deletes it after the import is complete. If there is insufficient disk space, the import will also fail.
  4. TiDB cluster failure: During the import process, if the TiDB cluster fails, such as PD node crash, TiKV node crash, etc., the import will also fail.
  5. Data conflict: During the import process, if there is conflicting data in the original database with the data to be imported, such as primary key conflicts, the import will also fail.
  6. Permission issues: TiDB Lightning also requires permissions for the downstream database account. If sufficient permissions are not granted, permission errors will occur during the import process.
    In summary, there can be many reasons for TiDB Lightning import failures, and troubleshooting needs to be done based on the specific situation. You can check the error messages of TiDB Lightning and analyze the actual situation to gradually solve the problem.

[Common Errors]
2. The most common types of problems and their solutions during TiDB usage:

[Solutions]

  1. TiDB cluster cannot start normally:
    Possible situations include TiDB component configuration errors, physical machine hardware failures, TiDB version incompatibility, etc. The solution requires locating the problem based on specific error logs and troubleshooting step by step.
  2. Poor TiDB cluster performance:
    Possible reasons include insufficient TiDB performance tuning, data distribution imbalance caused by write hotspots, etc. The solution can be to use the TiUP Bench tool to perform benchmark testing on the cluster and then perform related optimizations.
  3. TiDB import/export data failure:
    Main issues include insufficient disk space, large data volume causing insufficient memory, unstable database connection, etc. The solution can be to add more disks or increase nodes in the TiDB cluster to improve resource capacity, or try splitting the data into multiple small batches for export.
  4. TiDB cluster data corruption:
    Possible reasons include disk failure, node crash, operating system anomalies, etc. The solution can be to solve the problem through backup and restore of the cluster.
  5. Insufficient TiDB SQL optimization:
    Possible reasons include unreasonable SQL writing, unreasonable table design, etc. The solution can be to use the TiDB Query Profiling tool to analyze SQL query performance and optimize it.
  6. TiDB security risks:
    Possible issues include privacy data leakage, insecure authentication, etc. The solution can be to strengthen the security mechanism of the TiDB cluster, restrict certain permissions and operations, and protect sensitive data security.