Error in Querying SQL Logs in Dashboard Due to Timeout

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 在dashboard中查询sql日志超时报错

| username: qufudcj

[TiDB Usage Environment] Production Environment
[TiDB Version] v4.0.4
[Reproduction Path]
Timeout error when querying “Slow Query” and “SQL Statement Analysis” on the page: error.api.other: invalid connection
[Problem Encountered: Phenomenon and Impact]
When querying “Slow Query” and “SQL Statement Analysis” on the page, it spins for a long time and then times out with an error: error.api.other: invalid connection


Observed that the CPU usage of the TiDB node increases after clicking the button
image

Observed TiDB error logs: It seems to have been restarted directly

goroutine 584 [running]:
github.com/pingcap/tidb/planner/core.GetStatsInfo(0x0, 0x0, 0x0)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/planner/core/util.go:220 +0x45
github.com/pingcap/tidb/util/expensivequery.genLogFields(0xdfa123b9f, 0xc002dcd550, 0x0, 0xc128aba8edf1ef63, 0x2172bd68fa)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:112 +0x377
github.com/pingcap/tidb/util/expensivequery.logExpensiveQuery(0xdfa123b9f, 0xc002dcd550)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:168 +0x41
github.com/pingcap/tidb/util/expensivequery.(*Handle).Run(0xc0009d1240)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:67 +0x1d9
created by main.createServer
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/tidb-server/main.go:641 +0x1ab
panic: interface conversion: interface is nil, not core.Plan

goroutine 566 [running]:
github.com/pingcap/tidb/planner/core.GetStatsInfo(0x0, 0x0, 0x9)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/planner/core/util.go:220 +0x45
github.com/pingcap/tidb/util/expensivequery.genLogFields(0xdf9d21bff, 0xc0009ee790, 0x0, 0xc128ac027a19bd74, 0x41dbe15d9d)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:112 +0x377
github.com/pingcap/tidb/util/expensivequery.logExpensiveQuery(0xdf9d21bff, 0xc0009ee790)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:168 +0x41
github.com/pingcap/tidb/util/expensivequery.(*Handle).Run(0xc001296900)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:67 +0x1d9
created by main.createServer
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/tidb-server/main.go:641 +0x1ab
panic: interface conversion: interface is nil, not core.Plan

goroutine 453 [running]:
github.com/pingcap/tidb/planner/core.GetStatsInfo(0x0, 0x0, 0x9)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/planner/core/util.go:220 +0x45
github.com/pingcap/tidb/util/expensivequery.genLogFields(0xdfbbd0340, 0xc003c42bb0, 0x0, 0xc128acf8763801fc, 0xd38998e6c7)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:112 +0x377
github.com/pingcap/tidb/util/expensivequery.logExpensiveQuery(0xdfbbd0340, 0xc003c42bb0)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:168 +0x41
github.com/pingcap/tidb/util/expensivequery.(*Handle).Run(0xc0011d6e60)
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/util/expensivequery/expensivequery.go:67 +0x1d9
created by main.createServer
        /home/jenkins/agent/workspace/tidb_v4.0.4/go/src/github.com/pingcap/tidb/tidb-server/main.go:641 +0x1ab
[2023/07/27 17:44:20.362 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=96549] [currIdx=0] [leaderStoreID=2]
[2023/07/27 17:44:20.362 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=86413] [currIdx=0] [leaderStoreID=2]
[2023/07/27 17:44:20.364 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=92513] [currIdx=2] [leaderStoreID=1]
[2023/07/27 17:44:31.945 +08:00] [INFO] [set.go:207] ["set session var"] [conn=472] [name=sql_select_limit] [val=50000]
[2023/07/27 17:44:32.824 +08:00] [INFO] [coprocessor.go:902] ["[TIME_COP_PROCESS] resp_time:30.91468386s txnStartTS:443141597919182873 region_id:0 store_addr:172.16.12.161:10080"] [conn=905]
[2023/07/27 17:44:32.824 +08:00] [ERROR] [select_result.go:242] ["invalid cop task execution summaries length"] [expected=2] [received=0]
[2023/07/27 17:44:39.323 +08:00] [INFO] [set.go:207] ["set session var"] [conn=472] [name=sql_select_limit] [val=18446744073709551615]
[2023/07/27 17:44:46.967 +08:00] [INFO] [set.go:207] ["set session var"] [conn=472] [name=sql_select_limit] [val=50000]
[2023/07/27 17:44:49.337 +08:00] [INFO] [set.go:207] ["set session var"] [conn=472] [name=sql_select_limit] [val=18446744073709551615]
[2023/07/27 17:45:17.141 +08:00] [INFO] [printer.go:33] ["Welcome to TiDB."] ["Release Version"=v4.0.4] [Edition=Community] ["Git Commit Hash"=c61fc7247e9f6bc773761946d5b5294d3f2699a5] ["Git Branch"=heads/refs/tags/v4.0.4] ["UTC Build Time"="2020-07-31 07:50:19"] [GoVersion=go1.13] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]
[2023/07/27 17:45:17.142 +08:00] [INFO] [printer.go:47] ["loaded config"] [config="{\"host\":\"0.0.0.0\",\"advertise-address\":\"172.16.14.3\",\"port\":4000,\"cors\":\"\",\"store\":\"tikv\",\"path\":\"172.16.14.10:2379,172.16.12.162:2379,172.16.15.227:2379\",\"socket\":\"\",\"lease\":\"45s\",\"run-ddl\":true,\"split-table\":true,\"token-limit\":1000,\"oom-use-tmp-storage\":true,\"tmp-storage-path\":\"/alidata/tidb/tmp\",\"oom-action\":\"log\",\"mem-quota-query\":1073741824,\"tmp-storage-quota\":-1,\"enable-streaming\":false,\"enable-batch-dml\":false,\"lower-case-table-names\":2,\"server-version\":\"\",\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":null,\"enable-timestamp\":null,\"disable-error-stack\":null,\"enable-error-stack\":null,\"file\":{\"filename\":\"/alidata/tidb/deploy/tidb-4000/log/tidb.log\",\"max-size\":300,\"max-days\":0,\"max-backups\":0},\"enable-slow-log\":true,\"slow-query-file\":\"/alidata/tidb/deploy/tidb-4000/log/tidb_slow_query.log\",\"slow-threshold\":300,\"expensive-threshold\":10000,\"query-log-max-len\":4096,\"record-plan-in-slow-log\":1},\"security\":{\"skip-grant-table\":false,\"ssl-ca\":\"\",\"ssl-cert\":\"\",\"ssl-key\":\"\",\"require-secure-transport\":false,\"cluster-ssl-ca\":\"\",\"cluster-ssl-cert\":\"\",\"cluster-ssl-key\":\"\",\"cluster-verify-cn\":null},\"status\":{\"status-host\":\"0.0.0.0\",\"metrics-addr\":\"\",\"status-port\":10080,\"metrics-interval\":15,\"report-status\":true,\"record-db-qps\":false},\"performance\":{\"max-procs\":0,\"max-memory\":0,\"stats-lease\":\"3s\",\"stmt-count-limit\":5000,\"feedback-probability\":0.05,\"query-feedback-limit\":1024,\"pseudo-estimate-ratio\":0.8,\"force-priority\":\"NO_PRIORITY\",\"bind-info-lease\":\"3s\",\"txn-total-size-limit\":2097152000,\"tcp-keep-alive\":true,\"cross-join\":true,\"run-auto-analyze\":true,\"agg-push-down-join\":false,\"committer-concurrency\":16,\"max-txn-ttl\":600000},\"prepared-plan-cache\":{\"enabled\":false,\"capacity\":100,\"memory-guard-ratio\":0.1},\"opentracing\":{\"enable\":false,\"rpc-metrics\":false,\"sampler\":{\"type\":\"const\",\"param\":1,\"sampling-server-url\":\"\",\"max-operations\":0,\"sampling-refresh-interval\":0},\"reporter\":{\"queue-size\":0,\"buffer-flush-interval\":0,\"log-spans\":false,\"local-agent-host-port\":\"\"}},\"proxy-protocol\":{\"networks\":\"\",\"header-timeout\":5},\"tikv-client\":{\"grpc-connection-count\":4,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"commit-timeout\":\"41s\",\"max-batch-size\":128,\"overload-threshold\":200,\"max-batch-wait-time\":0,\"batch-wait-size\":8,\"enable-chunk-rpc\":true,\"region-cache-ttl\":600,\"store-limit\":0,\"store-liveness-timeout\":\"5s\",\"copr-cache\":{\"enable\":false,\"capacity-mb\":1000,\"admission-max-result-mb\":10,\"admission-min-process-ms\":5}},\"binlog\":{\"enable\":false,\"ignore-error\":false,\"write-timeout\":\"15s\",\"binlog-socket\":\"\",\"strategy\":\"range\"},\"compatible-kill-query\":false,\"plugin\":{\"dir\":\"\",\"load\":\"\"},\"pessimistic-txn\":{\"enable\":true,\"max-retry-count\":256},\"check-mb4-value-in-utf8\":true,\"max-index-length\":3072,\"alter-primary-key\":false,\"treat-old-version-utf8-as-utf8mb4\":true,\"enable-table-lock\":false,\"delay-clean-table-lock\":0,\"split-region-max-num\":1000,\"stmt-summary\":{\"enable\":true,\"enable-internal-query\":false,\"max-stmt-count\":200,\"max-sql-length\":4096,\"refresh-interval\":1800,\"history-size\":24},\"repair-mode\":false,\"repair-table-list\":[],\"isolation-read\":{\"engines\":[\"tikv\",\"tiflash\",\"tidb\"]},\"max-server-connections\":0,\"new_collations_enabled_on_first-bootstrap\":false,\"experimental\":{\"allow-expression-index\":false},\"enable-collect-execution-info\":true,\"skip-register-to-dashboard\":false,\"enable-telemetry\":true}"]
[2023/07/27 17:45:17.142 +08:00] [INFO] [main.go:341] ["disable Prometheus push client"]
[2023/07/27 17:45:17.142 +08:00] [INFO] [store.go:68] ["new store"] [path=tikv://172.16.14.10:2379,172.16.12.162:2379,172.16.15.227:2379]
[2023/07/27 17:45:17.143 +08:00] [INFO] [client.go:149] ["[pd] create pd client with endpoints"] [pd-address="[172.16.14.10:2379,172.16.12.162:2379,172.16.15.227:2379]"]
[2023/07/27 17:45:17.143 +08:00] [INFO] [systime_mon.go:25] ["start system time monitor"]
[2023/07/27 17:45:17.146 +08:00] [INFO] [base_client.go:226] ["[pd] update member urls"] [old-urls="[http://172.16.14.10:2379,http://172.16.12.162:2379,http://172.16.15.227:2379]"] [new-urls="[http://172.16.12.162:2379,http://172.16.14.10:2379,http://172.16.15.227:2379]"]
[2023/07/27 17:45:17.146 +08:00] [INFO] [base_client.go:242] ["[pd] switch leader"] [new-leader=http://172.16.14.10:2379] [old-leader=]
[2023/07/27 17:45:17.146 +08:00] [INFO] [base_client.go:92] ["[pd] init cluster id"] [cluster-id=7222620707440961173]
[2023/07/27 17:45:17.146 +08:00] [INFO] [store.go:74] ["new store with retry success"]
[2023/07/27 17:45:17.155 +08:00] [INFO] [tidb.go:71] ["new domain"] [store=tikv-7222620707440961173] ["ddl lease"=45s] ["stats lease"=3s]
[2023/07/27 17:45:17.159 +08:00] [INFO] [ddl.go:316] ["[ddl] start DDL"] [ID=7b0dd95d-4435-40a1-be17-164fd805d90d] [runWorker=true]
[2023/07/27 17:45:17.159 +08:00] [INFO] [manager.go:185] ["start campaign owner"] [ownerInfo="[ddl] /tidb/ddl/fg/owner"]
[2023/07/27 17:45:17.162 +08:00] [INFO] [ddl.go:305] ["[ddl] start delRangeManager OK"] ["is a emulator"=false]
[2023/07/27 17:45:17.163 +08:00] [INFO] [ddl_worker.go:130] ["[ddl] start DDL worker"] [worker="worker 1, tp general"]
[2023/07/27 17:45:17.163 +08:00] [INFO] [ddl_worker.go:130] ["[ddl] start DDL worker"] [worker="worker 2, tp add index"]
[2023/07/27 17:45:17.334 +08:00] [INFO] [domain.go:145] ["full load InfoSchema success"] [usedSchemaVersion=0] [neededSchemaVersion=1045] ["start time"=134.432038ms]
[2023/07/27 17:45:17.337 +08:00
| username: tidb菜鸟一只 | Original post link

Check your slow log, is the file large?

| username: WalterWj | Original post link

The cluster version is quite low…

| username: qufudcj | Original post link

Not too big or too many, each file is 300M.
image

| username: qufudcj | Original post link

Because it was deployed using Ansible, there is a synchronization link involving Kafka, and the product is stable. The development team is relatively conservative, so they prefer not to upgrade.

| username: qufudcj | Original post link

Another TiDB cluster with the exact same configuration and version loads instantly, so I’m very puzzled.

| username: redgame | Original post link

Try moving a few out and leave one.

| username: 裤衩儿飞上天 | Original post link

  1. For a single TiDB node, it’s 300M*17/1024=4.98G, nearly 5G. If there are 2 TiDB nodes, the slow log size for each node should be similar, estimated to be 10G in total. Manually clean up the older slow logs, as keeping them doesn’t serve much purpose.

  2. Regularly optimize what needs to be optimized.

| username: qufudcj | Original post link

I just tried moving all the log files on all nodes, leaving only a very small one, but the same error still occurs.

| username: 裤衩儿飞上天 | Original post link

All TiDB nodes need to be cleaned up~

Check the logs for any related errors and monitor them.

| username: qufudcj | Original post link

Yes, I have already cleared the logs on the three tidb-servers.

The situation is still the same: every time I click the query button, a random tidb instance reports an error:

[2023/07/28 10:48:36.558 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=95789] [currIdx=2] [leaderStoreID=3]
[2023/07/28 10:48:36.560 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=96185] [currIdx=2] [leaderStoreID=3]
[2023/07/28 10:48:37.209 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=93397] [currIdx=2] [leaderStoreID=3]
[2023/07/28 10:48:37.229 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=54861] [currIdx=1] [leaderStoreID=2]
[2023/07/28 10:48:49.108 +08:00] [INFO] [server.go:388] ["new connection"] [conn=971] [remoteAddr=172.16.12.162:54160]
[2023/07/28 10:48:50.120 +08:00] [INFO] [coprocessor.go:902] ["[TIME_COP_PROCESS] resp_time:995.561644ms txnStartTS:443157716432846849 region_id:0 store_addr:172.16.15.225:10080"] [conn=971]
[2023/07/28 10:48:50.120 +08:00] [ERROR] [select_result.go:242] ["invalid cop task execution summaries length"] [expected=2] [received=0]
[2023/07/28 10:48:51.618 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=400] [currIdx=2] [leaderStoreID=3]
[2023/07/28 10:48:55.276 +08:00] [INFO] [coprocessor.go:902] ["[TIME_COP_PROCESS] resp_time:6.151780699s txnStartTS:443157716432846849 region_id:0 store_addr:172.16.14.3:10080"] [conn=971]
[2023/07/28 10:48:55.276 +08:00] [ERROR] [select_result.go:242] ["invalid cop task execution summaries length"] [expected=2] [received=0]
[2023/07/28 10:49:12.717 +08:00] [INFO] [region_cache.go:828] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=54861] [currIdx=2] [leaderStoreID=3]
[2023/07/28 10:50:04.501 +08:00] [INFO] [printer.go:33] ["Welcome to TiDB."] ["Release Version"=v4.0.4] [Edition=Community] ["Git Commit Hash"=c61fc7247e9f6bc773761946d5b5294d3f2699a5] ["Git Branch"=heads/refs/tags/v4.0.4] ["UTC Build Time"="2020-07-31 07:50:19"] [GoVersion=go1.13] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]

Then the tidb-server restarts, and the other two tidb-servers do not encounter this error and do not restart.

Each time, a random tidb-server reports an error and restarts.
So, this morning, they have been restarted many times in turn.
image

| username: qufudcj | Original post link

From the monitoring, the CPU usage of all TiDB-server nodes has surged. Sometimes all nodes surge, sometimes just one or two. Even considering the restart phenomenon, no pattern can be observed. There are no other alerts or monitoring anomalies.

| username: 裤衩儿飞上天 | Original post link

It feels like encountering a bug :smiling_imp:
There is a similar post, but their version is 5.X, and TiDB hasn’t restarted either…
invalid cop task execution summaries length

| username: qufudcj | Original post link

It doesn’t seem quite right… I’ll try restarting the PD component during the low peak period at night, and then turn on the debug logs.

| username: foxchan | Original post link

First, list the following monitoring screenshots:

  1. Connection count on the TiDB monitoring panel
  2. CPU and memory usage of the TiDB instance
| username: qufudcj | Original post link

Okay, I’ll conduct several experiments after the evening peak business hours and post the results. There is a bit too much interference information during the day.

| username: qufudcj | Original post link

After restarting PD, everything has recovered, and none of the issues have reoccurred. Here is the monitoring data from 10:48 when the problem previously occurred.