What are the precautions for TiDB database cluster inspection?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB数据库集群巡检有哪些注意项?

| username: 江湖故人

[TiDB Usage Environment] Production Environment
What areas do you focus on during daily and monthly inspections when maintaining TiDB?

| username: tidb菜鸟一只 | Original post link

Generally, there are these few aspects…

  1. Node Status Monitoring: Check the status of each node in the TiDB cluster, including TiDB Server, TiKV, and PD nodes. Ensure that the nodes are running normally without any anomalies or error messages.
  2. Resource Utilization Check: Observe the resource utilization of the TiDB cluster, including CPU, memory, disk, and network. Ensure that resources are sufficient and there are no obvious performance bottlenecks.
  3. Log and Error Log Analysis: Check the logs and error logs of the TiDB cluster to identify potential issues or anomalies. Pay special attention to warning messages and abnormal events in the error logs.
  4. Performance Tuning and Optimization: Evaluate the performance of the TiDB cluster, identify potential performance bottlenecks, and perform targeted tuning and optimization to improve database performance and response speed.
  5. Data Backup and Recovery: Confirm that the data backup strategy of the TiDB cluster is being executed normally and test the recovery process to ensure that data can be restored promptly in case of a failure.
| username: TiDBer_小阿飞 | Original post link

| username: 春风十里 | Original post link

Regular monthly or quarterly inspections will observe long-term trends, such as CPU, memory, disk space, and network bandwidth, to determine if there are capacity risks in the near future. Additionally, they will check if there is a long-term increasing trend in the execution time and frequency of key SQL queries, which represents a quantitative risk.

| username: zxgaa | Original post link

View long-term resource usage trends, error logs, read/write hotspots, etc.

| username: 江湖故人 | Original post link

Thank you for the summary. It should be similar to the maintenance work of other databases.

| username: 江湖故人 | Original post link

  1. Node status, version, and start time in the instance panel
  2. Machine CPU/memory/disk in the host panel
  3. Topsql panel to understand daily SQL information
  4. Region information panel, including lagging regions and regions lacking replicas
  5. KV request latency
  6. PD request TSO wait time
  7. Load, network, CPU, and I/O in the overview panel
  8. Number of abnormal requests
  9. select * from tidb; If a GC (Garbage Collection) error occurs, it may cause excessive retention of historical data, affecting access efficiency.
| username: Fly-bird | Original post link

  1. Server-related
  2. TiDB service-related
  3. Data-related (TiDB Dashboard)
| username: Kongdom | Original post link

Here are a few additional tasks from our side:

  1. Check table health
  2. Check tiup backup status
  3. Check error logs of each component via the dashboard
  4. Check the status of ntp and nginx
  5. Check scheduled tasks such as data synchronization
| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.