Suddenly unable to connect to TiDB

translator_bot · June 23, 2024, 7:16am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb突然连接不上

| username: cheng

【TiDB Environment】Production
【TiDB Version】5.7.25-TiDB-v5.3.0
【Encountered Problem】Business cannot connect
【Reproduction Path】What operations were performed to cause the problem
【Problem Phenomenon and Impact】

The business side reported that they couldn’t connect to TiDB.
I checked the dashboard and saw that TiDB had restarted.
Then I checked the tidb.log, as follows:
2022/08/10 14:32:44.867 +08:00] [INFO] [trackerRecorder.go:29] [“Mem Profile Tracker started”]
[2022/08/10 14:32:44.868 +08:00] [INFO] [printer.go:48] [“loaded config”] [config=“{"host":"0.0.0.0","advertise-address":"10.33.xx.xx","port":4000,"cors":"","store":"tikv","path":"10.33…xx.xx:2379,10.33…xx.xx:2379,10.33…xx.xx:2379","socket":"/tmp/tidb-4000.sock","lease":"45s","run-ddl":true,"split-table":true,"token-limit":1000,"oom-use-tmp-storage":true,"tmp-storage-path":"/tmp/1003_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage","oom-action":"cancel","mem-quota-query":1073741824,"tmp-storage-quota":-1,"enable-batch-dml":false,"lower-case-table-names":2,"server-version":"","log":{"level":"info","format":"text","disable-timestamp":null,"enable-timestamp":null,"disable-error-stack":null,"enable-error-stack":null,"file":{"filename":"/httx/run/tidb/deploy/tidb-4000/log/tidb.log","max-size":300,"max-days":0,"max-backups":0},"enable-slow-log":true,"slow-query-file":"/httx/run/tidb/deploy/tidb-4000/log/tidb_slow_query.log","slow-threshold":300,"expensive-threshold":10000,"query-log-max-len":4096,"record-plan-in-slow-log":1},"security":{"skip-grant-table":false,"ssl-ca":"","ssl-cert":"","ssl-key":"","require-secure-transport":false,"cluster-ssl-ca":"","cluster-ssl-cert":"","cluster-ssl-key":"","cluster-verify-cn":null,"spilled-file-encryption-method":"plaintext","enable-sem":false,"auto-tls":false,"tls-version":"","rsa-key-size":4096,"secure-bootstrap":false},"status":{"status-host":"0.0.0.0","metrics-addr":"","status-port":10080,"metrics-interval":15,"report-status":true,"record-db-qps":false},"performance":{"max-procs":0,"max-memory":0,"server-memory-quota":0,"memory-usage-alarm-ratio":0.8,"stats-lease":"3s","stmt-count-limit":5000,"feedback-probability":0,"query-feedback-limit":512,"pseudo-estimate-ratio":0.8,"force-priority":"NO_PRIORITY","bind-info-lease":"3s","txn-entry-size-limit":6291456,"txn-total-size-limit":104857600,"tcp-keep-alive":true,"tcp-no-delay":true,"cross-join":true,"run-auto-analyze":true,"distinct-agg-push-down":false,"committer-concurrency":128,"max-txn-ttl":3600000,"mem-profile-interval":"1m","index-usage-sync-lease":"0s","plan-replayer-gc-lease":"10m","gogc":100,"enforce-mpp":false},"prepared-plan-cache":{"enabled":false,"capacity":1000,"memory-guard-ratio":0.1},"opentracing":{"enable":false,"rpc-metrics":false,"sampler":{"type":"const","param":1,"sampling-server-url":"","max-operations":0,"sampling-refresh-interval":0},"reporter":{"queue-size":0,"buffer-flush-interval":0,"log-spans":false,"local-agent-host-port":""}},"proxy-protocol":{"networks":"","header-timeout":5},"pd-client":{"pd-server-timeout":3},"tikv-client":{"grpc-connection-count":4,"grpc-keepalive-time":10,"grpc-keepalive-timeout":3,"grpc-compression-type":"none","commit-timeout":"41s","async-commit":{"keys-limit":256,"total-key-size-limit":4096,"safe-window":2000000000,"allowed-clock-drift":500000000},"max-batch-size":128,"overload-threshold":200,"max-batch-wait-time":0,"batch-wait-size":8,"enable-chunk-rpc":true,"region-cache-ttl":600,"store-limit":0,"store-liveness-timeout":"1s","copr-cache":{"capacity-mb":1000},"ttl-refreshed-txn-size":33554432,"resolve-lock-lite-threshold":16},"binlog":{"enable":true,"ignore-error":true,"write-timeout":"15s","binlog-socket":"","strategy":"range"},"compatible-kill-query":false,"plugin":{"dir":"/data/deploy/plugin","load":""},"pessimistic-txn":{"max-retry-count":256,"deadlock-history-capacity":10,"deadlock-history-collect-retryable":false},"check-mb4-value-in-utf8":true,"max-index-length":3072,"index-limit":64,"table-column-count-limit":1017,"graceful-wait-before-shutdown":0,"alter-primary-key":false,"treat-old-version-utf8-as-utf8mb4":true,"enable-table-lock":false,"delay-clean-table-lock":0,"split-region-max-num":1000,"stmt-summary":{"enable":true,"enable-internal-query":false,"max-stmt-count":3000,"max-sql-length":4096,"refresh-interval":1800,"history-size":24},"top-sql":{"receiver-address":""},"repair-mode":false,"repair-table-list":,"isolation-read":{"engines":["tikv","tiflash","tidb"]},"max-server-connections":0,"new_collations_enabled_on_first-bootstrap":false,"experimental":{"allow-expression-index":false},"enable-collect-execution-info":true,"skip-register-to-dashboard":false,"enable-telemetry":true,"labels":{},"enable-global-index":false,"deprecate-integer-display-length":false,"enable-enum-length-limit":true,"stores-refresh-interval":60,"enable-tcp4-only":false,"enable-forwarding":false,"max-ballast-object-size":0,"ballast-object-size":0}”]
[2022/08/10 14:32:44.868 +08:00] [INFO] [client.go:352] [“[pd] create pd client with endpoints”] [pd-address=“[10.33…xx.xx:2379,10.33…xx.xx:2379,10.33…xx.xx:2379]”]
[2022/08/10 14:32:44.877 +08:00] [INFO] [base_client.go:349] [“[pd] switch leader”] [new-leader=http://10.33.xx.xx:2379] [old-leader=]
[2022/08/10 14:32:44.877 +08:00] [INFO] [base_client.go:104] [“[pd] init cluster id”] [cluster-id=7054449822183088374]
[2022/08/10 14:32:44.877 +08:00] [INFO] [client.go:648] [“[pd] tso dispatcher created”] [dc-location=global]

[2022/08/10 14:32:44.877 +08:00] [ERROR] [client.go:845] [“[pd] update connection contexts failed”] [dc=global] [error=“rpc error: code = Canceled desc = context canceled”]

[2022/08/10 14:32:44.877 +08:00] [INFO] [client.go:666] [“[pd] exit tso dispatcher”] [dc-location=global]
[2022/08/10 14:32:44.879 +08:00] [INFO] [main.go:340] [tidb-server] [“create pumps client success, ignore binlog error”=true]
[2022/08/10 14:32:44.879 +08:00] [INFO] [main.go:349] [“disable Prometheus push client”]

Grafana monitoring also shows TiDB CPU dropped to 0.

translator_bot · June 23, 2024, 7:16am

| username: cheng | Original post link

How do I troubleshoot the reason for TiDB restarting?

translator_bot · June 23, 2024, 7:16am

| username: h5n1 | Original post link

Check the logs for the keyword “Welcome” and see what information was there before.
Common reasons for OOM:

Large data volume or high concurrency leading to excessive memory usage
analyze_version=2
Other bugs

translator_bot · June 23, 2024, 7:16am

| username: cheng | Original post link

Hello, welcome. I didn’t see anything unusual in the previous logs!

translator_bot · June 23, 2024, 7:16am

| username: cheng | Original post link

I noticed that the mem usage of the node that restarted in the tidb-server monitoring reached a maximum of 19.8 (out of 23.3) before the restart. Would this proportion cause an OOM?

translator_bot · June 23, 2024, 7:16am

| username: h5n1 | Original post link

Check if there are any slow SQL queries with large data volumes before the OOM. You can look at the slow SQL in the dashboard and check STATEMENTS_SUMMARY and STATEMENTS_SUMMARY_HISTORY to see which queries have high memory usage by summing up max_mem.
Adjust analyze_version=1, and follow the steps in the link below to delete the existing statistics with version=2:
常规统计信息 | PingCAP 文档中心

translator_bot · June 23, 2024, 7:16am

| username: TiDBer_jYQINSnf | Original post link

This should be an OOM (Out of Memory) issue. You can execute dmesg on the corresponding physical machine, which will definitively tell you if it is an OOM kill. As for the reason for the OOM, it is likely due to a large number of table scans. Check the execution plan on the dashboard and make the necessary adjustments to the SQL. If you cannot modify the SQL, adjust the parameters, such as oom-action.

translator_bot · June 23, 2024, 7:16am

| username: cheng | Original post link

Before the incident, there weren’t particularly many slow SQL queries, and the max_mem for querying these two tables wasn’t large during the time period before and after the incident. Could it be that they need to be combined? Then I confirmed that analyze_version=2. Is it possible that it was caused by a lock?

translator_bot · June 23, 2024, 7:16am

| username: cheng | Original post link

How do I execute dmesg?

translator_bot · June 23, 2024, 7:16am

| username: TiDBer_jYQINSnf | Original post link

Just like executing ping. It outputs a lot, and if the process is killed, it will be displayed. Looking at the logs, it should be an OOM kill.

translator_bot · June 23, 2024, 7:16am

| username: h5n1 | Original post link

It is probably caused by analyze.

translator_bot · June 23, 2024, 7:16am

| username: alfred | Original post link

Check the operating system logs, such as Linux’s messages, to confirm if there was an OOM (Out of Memory) event. Then analyze the TiDB logs, as well as the number of active connections and slow SQL queries around the time the issue occurred.

translator_bot · June 23, 2024, 7:16am

| username: cheng | Original post link

Would adjusting analyze_version=1 solve this problem?

translator_bot · June 23, 2024, 7:16am

| username: cheng | Original post link

Okay, confirmed, it indeed OOMed. Thank you very much.

translator_bot · June 23, 2024, 7:16am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.