TIKV_REGION_STATUS can only be queried using the first PD

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TIKV_REGION_STATUS只能使用第一个PD进行查询

| username: guoyanliang

Bug Report
There are 3 nodes in the PD cluster. When a test data center goes down, one node is lost. Querying data tables works normally, but when checking the region distribution and executing the SQL select * from TIKV_REGION_STATUS;, it fails and shows

. Does querying this table only schedule to the first PD node?
[TiDB Version]
v5.4.0
[Impact of the Bug]
When the first PD node is lost, it is not possible to continue obtaining data from the select * from TIKV_REGION_STATUS; table.
[Possible Steps to Reproduce the Issue]

[Observed Unexpected Behavior]

[Expected Behavior]

[Related Components and Specific Versions]

[Other Background Information or Screenshots]
Such as cluster topology, system and kernel versions, application app information, etc.; if the issue is related to SQL, please provide the SQL statement and related table schema information; if there are critical errors in the node logs, please provide the relevant node log content or files; if some business-sensitive information is inconvenient to provide, please leave contact information, and we will communicate with you privately.

| username: xfworld | Original post link

Before executing this query, does the PD leader already exist, or is it in the process of being elected?

| username: Hacker007 | Original post link

Still synchronizing metadata? In the middle of an election?

| username: ddhe9527 | Original post link

There is indeed a problem, and it can be reproduced regardless of whether it is the Leader or not.

| username: yilong | Original post link

  1. Tested it, and the issue can be reproduced. The following error stack is printed.
    [2022/06/24 10:27:47.896 +08:00] [INFO] [conn.go:1115] [“command dispatched failed”] [conn=7] [connInfo=“id:7, addr:172.xxx.xx.136:55676 status:10, collation:utf8_general_ci, user:root”] [command=Query] [status=“inTxn:0, autocommit:1”] [sql=“select * from TIKV_REGION_STATUS”] [txn_mode=PESSIMISTIC] [err=“Get "http://172.xxx.xx.162:18279/pd/api/v1/regions\”: dial tcp 172.xxx.xx.162:18279: connect: connection refused
    github.com/pingcap/errors.AddStack
    \t/nfs/cache/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174
    github.com/pingcap/errors.Trace
    \t/nfs/cache/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/juju_adaptor.go:15
    github.com/pingcap/tidb/store/helper.(*Helper).requestPD
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/helper/helper.go:813
    github.com/pingcap/tidb/store/helper.(*Helper).GetRegionsInfo
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/store/helper/helper.go:771
    github.com/pingcap/tidb/executor.(*memtableRetriever).setDataForTiKVRegionStatus
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/infoschema_reader.go:1449
    github.com/pingcap/tidb/executor.(*memtableRetriever).retrieve
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/infoschema_reader.go:141
    github.com/pingcap/tidb/executor.(*MemTableReaderExec).Next
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/memtable_reader.go:118
    github.com/pingcap/tidb/executor.Next
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/executor.go:286
    github.com/pingcap/tidb/executor.(*recordSet).Next
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/executor/adapter.go:149
    github.com/pingcap/tidb/server.(*tidbResultSet).Next
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/driver_tidb.go:312
    github.com/pingcap/tidb/server.(*clientConn).writeChunks
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:2165
    github.com/pingcap/tidb/server.(*clientConn).writeResultset
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:2116
    github.com/pingcap/tidb/server.(*clientConn).handleStmt
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1994
    github.com/pingcap/tidb/server.(*clientConn).handleQuery
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1841
    github.com/pingcap/tidb/server.(*clientConn).dispatch
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1336
    github.com/pingcap/tidb/server.(*clientConn).Run
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/conn.go:1091
    github.com/pingcap/tidb/server.(*Server).onConn
    \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:548
    runtime.goexit
    \t/usr/local/go/src/runtime/asm_amd64.s:1371"]
  2. Initially, it seems that the code here accesses the information of the down PD node.
    for _, host := range pdHosts {
    req, err = http.NewRequest(method, util.InternalHTTPSchema()+“://”+host+uri, body)
    if err != nil {
    // Try to request from another PD node when some nodes may down.
    if strings.Contains(err.Error(), “connection refused”) {
    continue
    }
    return errors.Trace(err)
    }
    }
    if err != nil {
    return err
    }
    start = time.Now()
    resp, err := util.InternalHTTPClient().Do(req)
    if err != nil {
    return errors.Trace(err)
    }
  3. Submitted an issue When PD node down, can not query TIKV_REGION_STATUS · Issue #35708 · pingcap/tidb · GitHub
| username: yilong | Original post link

Additionally, if you really encounter this issue, the workaround is:
Use the command tiup ctl:v5.4.0 pd -u xxxxxxx to check the member information of the down node with the PD control command.
Use member delete id or member delete name to delete the down node. Then you can continue to access it.

| username: ddhe9527 | Original post link

Yes, deleting the faulty PD node can solve the problem.

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.