[TiDB Usage Environment] Production Environment / Test / Poc
[TiDB Version] v7.1.1
[Reproduction Path] Operations performed that led to the issue
PD and TiDB are on the same machine. After reinstalling the system, used pd-recover to restore PD, and created a new cluster. Upgraded the version from 6.5.2 to 7.1.1.
[Encountered Issue: Problem Phenomenon and Impact]
Creating a new table and executing SQL reports a PD server timeout error, while old tables work fine.
[2023/09/26 10:04:42.147 +08:00] [ERROR] [manager.go:328] [“task manager error”] [dist_task_manager=192.168.1.70:4000] [error=“[tikv:9001]PD server timeout: “]
[2023/09/26 10:04:42.330 +08:00] [WARN] [backoff.go:158] [“pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nregion not found for key "748000000000002B0E5F720000000000000000" at 2023-09-26T10:04:34.27196196+08:00\nregion not found for key "748000000000002B0E5F720000000000000000" at 2023-09-26T10:04:36.85500499+08:00\nregion not found for key "748000000000002B0E5F720000000000000000" at 2023-09-26T10:04:39.470411601+08:00\nlongest sleep type: pdRPC, time: 12058ms”]
[2023/09/26 10:04:42.331 +08:00] [INFO] [conn.go:1184] [“command dispatched failed”] [conn=5165622175524192701] [connInfo=“id:5165622175524192701, addr:192.168.1.1:55895 status:10, collation:utf8_general_ci, user:root”] [command=Query] [status=“inTxn:0, autocommit:1”] [sql=“SELECT * FROM yqc_data_count.kafka_date_check LIMIT 0,1000”] [txn_mode=PESSIMISTIC] [timestamp=444515973925699585] [err=”[tikv:9001]PD server timeout: \ngithub.com/pingcap/errors.AddStack\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/normalize.go:164\ngithub.com/pingcap/tidb/store/driver/error.ToTiDBErr\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/driver/error/error.go:112\ngithub.com/pingcap/tidb/store/copr.(*RegionCache).SplitKeyRangesByLocations\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/region_cache.go:139\ngithub.com/pingcap/tidb/store/copr.(*RegionCache).SplitKeyRangesByBuckets\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/region_cache.go:187\ngithub.com/pingcap/tidb/store/copr.buildCopTasks\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/coprocessor.go:335\ngithub.com/pingcap/tidb/store/copr.(*CopClient).BuildCopIterator.func3\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/coprocessor.go:148\ngithub.com/pingcap/tidb/kv.(*KeyRanges).ForEachPartitionWithErr\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/kv/kv.go:456\ngithub.com/pingcap/tidb/store/copr.(*CopClient).BuildCopIterator\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/coprocessor.go:162\ngithub.com/pingcap/tidb/store/copr.(*CopClient).Send\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/coprocessor.go:92\ngithub.com/pingcap/tidb/distsql.Select\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/distsql/distsql.go:99\ngithub.com/pingcap/tidb/distsql.SelectWithRuntimeStats\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/distsql/distsql.go:149\ngithub.com/pingcap/tidb/executor.selectResultHook.SelectResult\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/table_reader.go:59\ngithub.com/pingcap/tidb/executor.(*TableReaderExecutor).buildResp\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/table_reader.go:347\ngithub.com/pingcap/tidb/executor.(*TableReaderExecutor).Open\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/table_reader.go:224\ngithub.com/pingcap/tidb/executor.(*baseExecutor).Open\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:204\ngithub.com/pingcap/tidb/executor.(*LimitExec).Open\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:1492\ngithub.com/pingcap/tidb/executor.(*ExecStmt).openExecutor\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/adapter.go:1195\ngithub.com/pingcap/tidb/executor.(*ExecStmt).Exec\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/adapter.go:531\ngithub.com/pingcap/tidb/session.runStmt\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2394\ngithub.com/pingcap/tidb/session.(*session).ExecuteStmt\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2251\ngithub.com/pingcap/tidb/server.(*TiDBContext).ExecuteStmt\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/driver_tidb.go:294\ngithub.com/pingcap/tidb/server.(*clientConn).handleStmt\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:2094\ngithub.com/pingcap/tidb/server.(*clientConn).handleQuery\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1885\ngithub.com/pingcap/tidb/server.(*clientConn).dispatch\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1372\ngithub.com/pingcap/tidb/server.(*clientConn).Run\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1153\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:677\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598”]
Do you mean that the original cluster version you used was 6.5.2, and then you generated tiup version 7.1.1 to manage the components by using tiup cluster deploy yqc v7.1.1 ./prod.yaml? Didn’t you upgrade it through tiup update cluster?
tiup cluster start yqc --init
When starting, I found that PD and TiKV could not connect, so I checked the official website and found that pd-recover could restore it. After following the operation, the cluster started normally. Today, when creating a new table and executing SQL, a PD Server Timeout occurred.
Isn’t your TiDB log written?
/ext/tmp/tmp_ddl-4000 disk space
[error=“the available disk space(45788008448) in /ext/tmp/tmp_ddl-4000 should be greater than @@tidb_ddl_disk_quota(107374182400)”]
Your SQL
[sql=“select pid,count(id) dim_num,‘certificationsummary’ as dim_type,DATE_FORMAT(CURDATE(), ‘%Y-%m-%d’) as pdate\r\nfrom bu_cert_info\r\nwhere is_deleted = 0\r\ngroup by pid”]
Error: insufficient memory
[err="Your query has been cancelled due to exceeding the allowed memory limit for a single SQL query. Please try narrowing your query scope or increase the tidb_mem_quota_query limit and try again.
Complete output log
[2023/09/26 09:59:17.196 +08:00] [INFO] [conn.go:1184] [“command dispatched failed”] [conn=5165622175524192659] [connInfo=“id:5165622175524192659, addr:192.168.1.243:57240 status:10, collation:utf8_general_ci, user:root”] [command=Query] [status=“inTxn:0, autocommit:1”] [sql=“select pid,count(id) dim_num,‘certificationsummary’ as dim_type,DATE_FORMAT(CURDATE(), ‘%Y-%m-%d’) as pdate\r\nfrom bu_cert_info\r\nwhere is_deleted = 0\r\ngroup by pid”] [txn_mode=PESSIMISTIC] [timestamp=444515888217980929] [err="Your query has been cancelled due to exceeding the allowed memory limit for a single SQL query. Please try narrowing your query scope or increase the tidb_mem_quota_query limit and try again.
Doing it this way is prone to issues. It is recommended to upgrade the cluster using the officially recommended tiup. It is not advisable for the cluster version and the versions of various components to be inconsistent.
tiup list --verbose
Check if there are inconsistencies in the version numbers of your components.
You can also look at the information_schema.cluster_info view!
Set global tidb_ddl_enable_fast_reorg=off; let’s turn off the index acceleration feature first. This feature requires the tidb_ddl_disk_quota parameter to be at least 100G, and I see that your space is completely insufficient…
Is it still giving an error? Check the logs when executing the DDL statement to see what error is reported. Also, after changing the above parameters, you need to log in again for the new session to take effect. If you want it to take effect in the current session, use set tidb_ddl_enable_fast_reorg=off; without adding global.
[2023/09/27 08:42:16.115 +08:00] [WARN] [backoff.go:158] [“pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nregion not found for key "748000000000002B1A5F720000000000000000" at 2023-09-27T08:42:09.556344803+08:00\nregion not found for key "748000000000002B1A5F720000000000000000" at 2023-09-27T08:42:11.663201037+08:00\nregion not found for key "748000000000002B1A5F720000000000000000" at 2023-09-27T08:42:14.544404707+08:00\nlongest sleep type: pdRPC, time: 10712ms”]
[2023/09/27 08:42:16.116 +08:00] [INFO] [conn.go:1184] [“command dispatched failed”] [conn=5165622175524193315] [connInfo=“id:5165622175524193315, addr:192.168.1.1:61524 status:10, collation:utf8_general_ci, user:root”] [command=Query] [status=“inTxn:0, autocommit:1”] [sql=“SELECT * FROM yqc_data_count.kafka_date_check LIMIT 0,1000”] [txn_mode=PESSIMISTIC] [timestamp=444537326891696131] [err=“[tikv:9001]PD server timeout: \ngithub.com/pingcap/errors.AddStack\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/normalize.go:164\ngithub.com/pingcap/tidb/store/driver/error.ToTiDBErr\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/driver/error/error.go:112\ngithub.com/pingcap/tidb/store/copr.(*RegionCache).SplitKeyRangesByLocations\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/region_cache.go:139\ngithub.com/pingcap/tidb/store/copr.(*RegionCache).SplitKeyRangesByBuckets\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/region_cache.go:187\ngithub.com/pingcap/tidb/store/copr.buildCopTasks\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/coprocessor.go:335\ngithub.com/pingcap/tidb/store/copr.(*CopClient).BuildCopIterator.func3\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/coprocessor.go:148\ngithub.com/pingcap/tidb/kv.(*KeyRanges).ForEachPartitionWithErr\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/kv/kv.go:456\ngithub.com/pingcap/tidb/store/copr.(*CopClient).BuildCopIterator\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/coprocessor.go:162\ngithub.com/pingcap/tidb/store/copr.(*CopClient).Send\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/copr/coprocessor.go:92\ngithub.com/pingcap/tidb/distsql.Select\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/distsql/distsql.go:99\ngithub.com/pingcap/tidb/distsql.SelectWithRuntimeStats\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/distsql/distsql.go:149\ngithub.com/pingcap/tidb/executor.selectResultHook.SelectResult\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/table_reader.go:59\ngithub.com/pingcap/tidb/executor.(*TableReaderExecutor).buildResp\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/table_reader.go:347\ngithub.com/pingcap/tidb/executor.(*TableReaderExecutor).Open\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/table_reader.go:224\ngithub.com/pingcap/tidb/executor.(*baseExecutor).Open\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:204\ngithub.com/pingcap/tidb/executor.(*LimitExec).Open\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:1492\ngithub.com/pingcap/tidb/executor.(*ExecStmt).openExecutor\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/adapter.go:1195\ngithub.com/pingcap/tidb/executor.(*ExecStmt).Exec\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/adapter.go:531\ngithub.com/pingcap/tidb/session.runStmt\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2394\ngithub.com/pingcap/tidb/session.(*session).ExecuteStmt\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2251\ngithub.com/pingcap/tidb/server.(*TiDBContext).ExecuteStmt\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/driver_tidb.go:294\ngithub.com/pingcap/tidb/server.(*clientConn).handleStmt\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:2094\ngithub.com/pingcap/tidb/server.(*clientConn).handleQuery\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1885\ngithub.com/pingcap/tidb/server.(*clientConn).dispatch\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1372\ngithub.com/pingcap/tidb/server.(*clientConn).Run\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1153\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:677\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598”]
[2023/09/27 08:42:16.383 +08:00] [WARN] [backoff.go:158] [“pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:09.952832129+08:00\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:11.926583009+08:00\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:14.759066156+08:00\nlongest sleep type: pdRPC, time: 11427ms”]
[2023/09/27 08:42:16.384 +08:00] [ERROR] [manager.go:328] [“task manager error”] [dist_task_manager=192.168.1.70:4000] [error=“[tikv:9001]PD server timeout: “]
[2023/09/27 08:42:16.594 +08:00] [WARN] [backoff.go:158] [“pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nregion not found for key "748000000000001D6C5F6980000000000000010163616E63656C6C69FF6E67000000000000F9" at 2023-09-27T08:42:09.711039983+08:00\nregion not found for key "748000000000001D6C5F6980000000000000010163616E63656C6C69FF6E67000000000000F9" at 2023-09-27T08:42:11.279578991+08:00\nregion not found for key "748000000000001D6C5F6980000000000000010163616E63656C6C69FF6E67000000000000F9" at 2023-09-27T08:42:14.048524485+08:00\nlongest sleep type: pdRPC, time: 12035ms”]
[2023/09/27 08:42:16.594 +08:00] [WARN] [dispatcher.go:160] [“get unfinished(pending, running or reverting) tasks failed”] [error=”[tikv:9001]PD server timeout: “]
[2023/09/27 08:42:16.948 +08:00] [WARN] [backoff.go:158] [“pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:10.879404659+08:00\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:13.18534525+08:00\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:14.743313155+08:00\nlongest sleep type: pdRPC, time: 10965ms”]
[2023/09/27 08:42:16.949 +08:00] [ERROR] [manager.go:328] [“task manager error”] [dist_task_manager=192.168.1.70:4000] [error=”[tikv:9001]PD server timeout: “]
[2023/09/27 08:42:23.022 +08:00] [INFO] [gc_worker.go:442] [”[gc worker] starts the whole job”] [uuid=62b3cd5e3e00005] [safePoint=444537174219030528] [concurrency=3]
[2023/09/27 08:42:23.025 +08:00] [INFO] [gc_worker.go:1250] [“[gc worker] start resolve locks”] [uuid=62b3cd5e3e00005] [safePoint=444537174219030528] [try-resolve-locks-ts=444537174219030528] [concurrency=3]
[2023/09/27 08:42:23.025 +08:00] [INFO] [range_task.go:137] [“range task started”] [name=resolve-locks-runner] [startKey=] [endKey=] [concurrency=3]
[2023/09/27 08:42:27.640 +08:00] [WARN] [backoff.go:158] [“pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:20.591541654+08:00\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:22.6222926+08:00\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:25.547932947+08:00\nlongest sleep type: pdRPC, time: 11246ms”]
[2023/09/27 08:42:27.640 +08:00] [ERROR] [manager.go:328] [“task manager error”] [dist_task_manager=192.168.1.70:4000] [error="[tikv:9001]PD server timeout: “]
[2023/09/27 08:42:27.762 +08:00] [WARN] [backoff.go:158] [“pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:20.212388224+08:00\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:23.17796901+08:00\nregion not found for key "748000000000001D6C5F69800000000000000101726576657274696EFF6700000000000000F8" at 2023-09-27T08:42:26.146820825+08:00\nlongest sleep type: pdRPC, time: 10805ms”]
[2023/09/27 08:42:27.762 +08:00] [ERROR] [manager.go:328] [“task manager error”] [dist_task_manager=192.168.1.70:4000] [error=”[tikv:9001]PD server timeout: “]
[2023/09/27 08:42:29.351 +08:00] [WARN] [backoff.go:158] [“pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nregion not found for key "748000000000001D6C5F6980000000000000010163616E63656C6C69FF6E67000000000000F9" at 2023-09-27T08:42:20.580868359+08:00\nregion not found for key "748000000000001D6C5F6980000000000000010163616E63656C6C69FF6E67000000000000F9" at 2023-09-27T08:42:23.509420725+08:00\nregion not found for key "748000000000001D6C5F6980000000000000010163616E63656C6C69FF6E67000000000000F9" at 2023-09-27T08:42:26.399567688+08:00\nlongest sleep type: pdRPC, time: 12748ms”]
[2023/09/27 08:42:29.352 +08:00] [WARN] [dispatcher.go:160] [“get unfinished(pending, running or reverting) tasks failed”] [error=”[tikv:9001]PD server timeout: "]
[2023/09/27 08:42:33.584 +08:00] [WARN] [backoff.go:158] [“pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nregion not found for key "748000000000001D665F720000000000000000" at 2023-09-27T08:42:26.891839985+08:00\nregion not found for key "748000000000001D665F720000000000000000" at 2023-09-27T08:42:28.764991961+08:00\nregion not found for key "748000000000001D665F720000000000000000" at 2023-09-27T08:42:30.607323161+08:00\nlongest sleep type: pdRPC, time: 10842ms”]
[2023/09/27 08:42:33.584 +08:00] [WARN] [task_manager.go:280] [“fail to peek scan task”] [ttl-worker=job-manager] [ttl-worker=task-manager] [error="execute sql: SELECT LOW_PRIORITY\n\tjob_id,\n\ttable_id,\n\tscan_id,\n\tscan_range_start,\n\tscan_range_end,\n\texpire_time,\n\towner_id,\n\towner
It is possible that there is an issue with the version of your cluster. I suggest exporting the data, recreating the cluster, and then importing the data back in.