Suddenly unable to connect to TiDB

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb突然连接不上

| username: cheng

【TiDB Environment】Production
【TiDB Version】5.7.25-TiDB-v5.3.0
【Encountered Problem】Business cannot connect
【Reproduction Path】What operations were performed to cause the problem
【Problem Phenomenon and Impact】

The business side reported that they couldn’t connect to TiDB.
I checked the dashboard and saw that TiDB had restarted.
Then I checked the tidb.log, as follows:
2022/08/10 14:32:44.867 +08:00] [INFO] [trackerRecorder.go:29] [“Mem Profile Tracker started”]
[2022/08/10 14:32:44.868 +08:00] [INFO] [printer.go:48] [“loaded config”] [config=“{"host":"0.0.0.0","advertise-address":"10.33.xx.xx","port":4000,"cors":"","store":"tikv","path":"10.33…xx.xx:2379,10.33…xx.xx:2379,10.33…xx.xx:2379","socket":"/tmp/tidb-4000.sock","lease":"45s","run-ddl":true,"split-table":true,"token-limit":1000,"oom-use-tmp-storage":true,"tmp-storage-path":"/tmp/1003_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage","oom-action":"cancel","mem-quota-query":1073741824,"tmp-storage-quota":-1,"enable-batch-dml":false,"lower-case-table-names":2,"server-version":"","log":{"level":"info","format":"text","disable-timestamp":null,"enable-timestamp":null,"disable-error-stack":null,"enable-error-stack":null,"file":{"filename":"/httx/run/tidb/deploy/tidb-4000/log/tidb.log","max-size":300,"max-days":0,"max-backups":0},"enable-slow-log":true,"slow-query-file":"/httx/run/tidb/deploy/tidb-4000/log/tidb_slow_query.log","slow-threshold":300,"expensive-threshold":10000,"query-log-max-len":4096,"record-plan-in-slow-log":1},"security":{"skip-grant-table":false,"ssl-ca":"","ssl-cert":"","ssl-key":"","require-secure-transport":false,"cluster-ssl-ca":"","cluster-ssl-cert":"","cluster-ssl-key":"","cluster-verify-cn":null,"spilled-file-encryption-method":"plaintext","enable-sem":false,"auto-tls":false,"tls-version":"","rsa-key-size":4096,"secure-bootstrap":false},"status":{"status-host":"0.0.0.0","metrics-addr":"","status-port":10080,"metrics-interval":15,"report-status":true,"record-db-qps":false},"performance":{"max-procs":0,"max-memory":0,"server-memory-quota":0,"memory-usage-alarm-ratio":0.8,"stats-lease":"3s","stmt-count-limit":5000,"feedback-probability":0,"query-feedback-limit":512,"pseudo-estimate-ratio":0.8,"force-priority":"NO_PRIORITY","bind-info-lease":"3s","txn-entry-size-limit":6291456,"txn-total-size-limit":104857600,"tcp-keep-alive":true,"tcp-no-delay":true,"cross-join":true,"run-auto-analyze":true,"distinct-agg-push-down":false,"committer-concurrency":128,"max-txn-ttl":3600000,"mem-profile-interval":"1m","index-usage-sync-lease":"0s","plan-replayer-gc-lease":"10m","gogc":100,"enforce-mpp":false},"prepared-plan-cache":{"enabled":false,"capacity":1000,"memory-guard-ratio":0.1},"opentracing":{"enable":false,"rpc-metrics":false,"sampler":{"type":"const","param":1,"sampling-server-url":"","max-operations":0,"sampling-refresh-interval":0},"reporter":{"queue-size":0,"buffer-flush-interval":0,"log-spans":false,"local-agent-host-port":""}},"proxy-protocol":{"networks":"","header-timeout":5},"pd-client":{"pd-server-timeout":3},"tikv-client":{"grpc-connection-count":4,"grpc-keepalive-time":10,"grpc-keepalive-timeout":3,"grpc-compression-type":"none","commit-timeout":"41s","async-commit":{"keys-limit":256,"total-key-size-limit":4096,"safe-window":2000000000,"allowed-clock-drift":500000000},"max-batch-size":128,"overload-threshold":200,"max-batch-wait-time":0,"batch-wait-size":8,"enable-chunk-rpc":true,"region-cache-ttl":600,"store-limit":0,"store-liveness-timeout":"1s","copr-cache":{"capacity-mb":1000},"ttl-refreshed-txn-size":33554432,"resolve-lock-lite-threshold":16},"binlog":{"enable":true,"ignore-error":true,"write-timeout":"15s","binlog-socket":"","strategy":"range"},"compatible-kill-query":false,"plugin":{"dir":"/data/deploy/plugin","load":""},"pessimistic-txn":{"max-retry-count":256,"deadlock-history-capacity":10,"deadlock-history-collect-retryable":false},"check-mb4-value-in-utf8":true,"max-index-length":3072,"index-limit":64,"table-column-count-limit":1017,"graceful-wait-before-shutdown":0,"alter-primary-key":false,"treat-old-version-utf8-as-utf8mb4":true,"enable-table-lock":false,"delay-clean-table-lock":0,"split-region-max-num":1000,"stmt-summary":{"enable":true,"enable-internal-query":false,"max-stmt-count":3000,"max-sql-length":4096,"refresh-interval":1800,"history-size":24},"top-sql":{"receiver-address":""},"repair-mode":false,"repair-table-list":,"isolation-read":{"engines":["tikv","tiflash","tidb"]},"max-server-connections":0,"new_collations_enabled_on_first-bootstrap":false,"experimental":{"allow-expression-index":false},"enable-collect-execution-info":true,"skip-register-to-dashboard":false,"enable-telemetry":true,"labels":{},"enable-global-index":false,"deprecate-integer-display-length":false,"enable-enum-length-limit":true,"stores-refresh-interval":60,"enable-tcp4-only":false,"enable-forwarding":false,"max-ballast-object-size":0,"ballast-object-size":0}”]
[2022/08/10 14:32:44.868 +08:00] [INFO] [client.go:352] [“[pd] create pd client with endpoints”] [pd-address=“[10.33…xx.xx:2379,10.33…xx.xx:2379,10.33…xx.xx:2379]”]
[2022/08/10 14:32:44.877 +08:00] [INFO] [base_client.go:349] [“[pd] switch leader”] [new-leader=http://10.33.xx.xx:2379] [old-leader=]
[2022/08/10 14:32:44.877 +08:00] [INFO] [base_client.go:104] [“[pd] init cluster id”] [cluster-id=7054449822183088374]
[2022/08/10 14:32:44.877 +08:00] [INFO] [client.go:648] [“[pd] tso dispatcher created”] [dc-location=global]

[2022/08/10 14:32:44.877 +08:00] [ERROR] [client.go:845] [“[pd] update connection contexts failed”] [dc=global] [error=“rpc error: code = Canceled desc = context canceled”]

[2022/08/10 14:32:44.877 +08:00] [INFO] [client.go:666] [“[pd] exit tso dispatcher”] [dc-location=global]
[2022/08/10 14:32:44.879 +08:00] [INFO] [main.go:340] [tidb-server] [“create pumps client success, ignore binlog error”=true]
[2022/08/10 14:32:44.879 +08:00] [INFO] [main.go:349] [“disable Prometheus push client”]

Grafana monitoring also shows TiDB CPU dropped to 0.

| username: cheng | Original post link

How do I troubleshoot the reason for TiDB restarting?

| username: h5n1 | Original post link

Check the logs for the keyword “Welcome” and see what information was there before.
Common reasons for OOM:

  1. Large data volume or high concurrency leading to excessive memory usage
  2. analyze_version=2
  3. Other bugs
| username: cheng | Original post link

Hello, welcome. I didn’t see anything unusual in the previous logs!

| username: cheng | Original post link

I noticed that the mem usage of the node that restarted in the tidb-server monitoring reached a maximum of 19.8 (out of 23.3) before the restart. Would this proportion cause an OOM?

| username: h5n1 | Original post link

  1. Check if there are any slow SQL queries with large data volumes before the OOM. You can look at the slow SQL in the dashboard and check STATEMENTS_SUMMARY and STATEMENTS_SUMMARY_HISTORY to see which queries have high memory usage by summing up max_mem.
  2. Adjust analyze_version=1, and follow the steps in the link below to delete the existing statistics with version=2:
    常规统计信息 | PingCAP 文档中心
| username: TiDBer_jYQINSnf | Original post link

This should be an OOM (Out of Memory) issue. You can execute dmesg on the corresponding physical machine, which will definitively tell you if it is an OOM kill. As for the reason for the OOM, it is likely due to a large number of table scans. Check the execution plan on the dashboard and make the necessary adjustments to the SQL. If you cannot modify the SQL, adjust the parameters, such as oom-action.

| username: cheng | Original post link

Before the incident, there weren’t particularly many slow SQL queries, and the max_mem for querying these two tables wasn’t large during the time period before and after the incident. Could it be that they need to be combined? Then I confirmed that analyze_version=2. Is it possible that it was caused by a lock?

| username: cheng | Original post link

How do I execute dmesg?

| username: TiDBer_jYQINSnf | Original post link

Just like executing ping. It outputs a lot, and if the process is killed, it will be displayed. Looking at the logs, it should be an OOM kill.

| username: h5n1 | Original post link

It is probably caused by analyze.

| username: alfred | Original post link

Check the operating system logs, such as Linux’s messages, to confirm if there was an OOM (Out of Memory) event. Then analyze the TiDB logs, as well as the number of active connections and slow SQL queries around the time the issue occurred.

| username: cheng | Original post link

Would adjusting analyze_version=1 solve this problem?

| username: cheng | Original post link

Okay, confirmed, it indeed OOMed. Thank you very much.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.