After an abnormal restart of the TiKV node, TiKV appears normal, but the region is missing, and TiDB cannot start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv 节点异常重启后,tikv显示正常,但region is missing,tidb启动不了

| username: 开心大河马

[TiDB Usage Environment] Test/
[TiDB Version]
[Reproduction Path] Only 2 TiKV nodes, 6 TiKV nodes in total, with labels applied, and the host was manually shut down.
[Encountered Problem: Problem Phenomenon and Impact]
[Resource Configuration] Virtual machine, 1 PD node, 2 TiDB nodes (PD and TiDB on the same node), 2 TiKV nodes
[Attachment: Screenshot/Log/Monitoring]

  1. Overall cluster status: Cluster installed using tiup

  2. TiDB startup error:

[2023/07/04 14:01:28.669 +08:00] [WARN] [backoff.go:158] ["regionMiss backoffer.maxSleep 40000ms is exceeded, errors:\nmessage:\"region 781 is missing\" region_not_found:<region_id:781 > at 2023-07-04T14:01:27.138043403+08:00\nmessage:\"region 781 is missing\" region_not_found:<region_id:781 > at 2023-07-04T14:01:27.646902214+08:00\nmessage:\"region 781 is missing\" region_not_found:<region_id:781 > at 2023-07-04T14:01:28.158281069+08:00\nlongest sleep type: regionMiss, time: 40010ms"]

[2023/07/04 14:01:28.669 +08:00] [FATAL] [terror.go:309] ["unexpected error"] [error="[tikv:9005]Region is unavailable"] [stack="github.com/pingcap/tidb/parser/terror.MustNil\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:309\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:348\nmain.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:241\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"] [stack="github.com/pingcap/tidb/parser/terror.MustNil\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:309\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:348\nmain.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:241\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]

[2023/07/04 14:01:43.930 +08:00] [INFO] [cpuprofile.go:113] ["parallel cpu profiler started"]

[2023/07/04 14:01:43.931 +08:00] [INFO] [printer.go:48] ["Welcome to TiDB."] ["Release Version"=v7.1.0] [Edition=Community] ["Git Commit Hash"=635a4362235e8a3c0043542e629532e3c7bb2756] ["Git Branch"=heads/refs/tags/v7.1.0] ["UTC Build Time"="2023-05-30 10:58:57"] [GoVersion=go1.20.3] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=6.2.0-alpha]

[2023/07/04 14:01:43.932 +08:00] [INFO] [printer.go:53] ["loaded config"] [config="{\"host\":\"0.0.0.0\",\"advertise-address\":\"10.xxx.xxx.181\",\"port\":4000,\"cors\":\"\",\"store\":\"tikv\",\"path\":\"10.xxx.xxx.181:2379\",\"socket\":\"/tmp/tidb-4000.sock\",\"lease\":\"45s\",\"split-table\":true,\"token-limit\":1000,\"temp-dir\":\"/opt/tidb/tmp\",\"tmp-storage-path\":\"/tmp/1000_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage\",\"tmp-storage-quota\":-1,\"server-version\":\"\",\"version-comment\":\"\",\"tidb-edition\":\"\",\"tidb-release-version\":\"\",\"keyspace-name\":\"\",\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":null,\"enable-timestamp\":null,\"disable-error-stack\":null,\"enable-error-stack\":null,\"file\":{\"filename\":\"/opt/tidb/tidb-deploy/tidb-4000/log/tidb.log\",\"max-size\":300,\"max-days\":30,\"max-backups\":0},\"slow-query-file\":\"/opt/tidb/tidb-deploy/tidb-4000/log/tidb_slow_query.log\",\"expensive-threshold\":10000,\"query-log-max-len\":4096,\"enable-slow-log\":true,\"slow-threshold\":300,\"record-plan-in-slow-log\":1,\"timeout\":0},\"instance\":{\"tidb_general_log\":false,\"tidb_pprof_sql_cpu\":false,\"ddl_slow_threshold\":300,\"tidb_expensive_query_time_threshold\":60,\"tidb_stmt_summary_enable_persistent\":false,\"tidb_stmt_summary_filename\":\"tidb-statements.log\",\"tidb_stmt_summary_file_max_days\":3,\"tidb_stmt_summary_file_max_size\":64,\"tidb_stmt_summary_file_max_backups\":0,\"tidb_enable_slow_log\":true,\"tidb_slow_log_threshold\":1000,\"tidb_record_plan_in_slow_log\":1,\"tidb_check_mb4_value_in_utf8\":true,\"tidb_force_priority\":\"NO_PRIORITY\",\"tidb_memory_usage_alarm_ratio\":0.8,\"tidb_enable_collect_execution_info\":true,\"plugin_dir\":\"/data/deploy/plugin\",\"plugin_load\":\"\",\"max_connections\":2000,\"tidb_enable_ddl\":true,\"tidb_rc_read_check_ts\":false},\"security\":{\"skip-grant-table\":false,\"ssl-ca\":\"\",\"ssl-cert\":\"\",\"ssl-key\":\"\",\"cluster-ssl-ca\":\"\",\"cluster-ssl-cert\":\"\",\"cluster-ssl-key\":\"\",\"cluster-verify-cn\":null,\"session-token-signing-cert\":\"\",\"session-token-signing-key\":\"\",\"spilled-file-encryption-method\":\"plaintext\",\"enable-sem\":false,\"auto-tls\":false,\"tls-version\":\"\",\"rsa-key-size\":4096,\"secure-bootstrap\":false,\"auth-token-jwks\":\"\",\"auth-token-refresh-interval\":\"1h0m0s\",\"disconnect-on-expired-password\":true},\"status\":{\"status-host\":\"0.0.0.0\",\"metrics-addr\":\"\",\"status-port\":10080,\"metrics-interval\":15,\"report-status\":true,\"record-db-qps\":false,\"record-db-label\":false,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"grpc-concurrent-streams\":1024,\"grpc-initial-window-size\":2097152,\"grpc-max-send-msg-size\":2147483647},\"performance\":{\"max-procs\":0,\"max-memory\":0,\"server-memory-quota\":0,\"stats-lease\":\"3s\",\"stmt-count-limit\":5000,\"pseudo-estimate-ratio\":0.8,\"bind-info-lease\":\"3s\",\"txn-entry-size-limit\":6291456,\"txn-total-size-limit\":104857600,\"tcp-keep-alive\":true,\"tcp-no-delay\":true,\"cross-join\":true,\"distinct-agg-push-down\":false,\"projection-push-down\":false,\"max-txn-ttl\":3600000,\"index-usage-sync-lease\":\"0s\",\"plan-replayer-gc-lease\":\"10m\",\"gogc\":100,\"enforce-mpp\":false,\"stats-load-concurrency\":5,\"stats-load-queue-size\":1000,\"analyze-partition-concurrency-quota\":16,\"plan-replayer-dump-worker-concurrency\":1,\"enable-stats-cache-mem-quota\":false,\"committer-concurrency\":128,\"run-auto-analyze\":true,\"force-priority\":\"NO_PRIORITY\",\"memory-usage-alarm-ratio\":0.8,\"enable-load-fmsketch\":false,\"lite-init-stats\":false,\"force-init-stats\":false},\"prepared-plan-cache\":{\"enabled\":true,\"capacity\":100,\"memory-guard-ratio\":0.1},\"opentracing\":{\"enable\":false,\"rpc-metrics\":false,\"sampler\":{\"type\":\"const\",\"param\":1,\"sampling-server-url\":\"\",\"max-operations\":0,\"sampling-refresh-interval\":0},\"reporter\":{\"queue-size\":0,\"buffer-flush-interval\":0,\"log-spans\":false,\"local-agent-host-port\":\"\"}},\"proxy-protocol\":{\"networks\":\"\",\"header-timeout\":5,\"fallbackable\":false},\"pd-client\":{\"pd-server-timeout\":3},\"tikv-client\":{\"grpc-connection-count\":4,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"grpc-compression-type\":\"none\",\"commit-timeout\":\"41s\",\"async-commit\":{\"keys-limit\":256,\"total-key-size-limit\":4096,\"safe-window\":2000000000,\"allowed-clock-drift\":500000000},\"max-batch-size\":128,\"overload-threshold\":200,\"max-batch-wait-time\":0,\"batch-wait-size\":8,\"enable-chunk-rpc\":true,\"region-cache-ttl\":600,\"store-limit\":0,\"store-liveness-timeout\":\"1s\",\"copr-cache\":{\"capacity-mb\":1000},\"ttl-refreshed-txn-size\":33554432,\"resolve-lock-lite-threshold\":16},\"binlog\":{\"enable\":false,\"ignore-error\":false,\"write-timeout\":\"15s\",\"binlog-socket\":\"\",\"strategy\":\"range\"},\"compatible-kill-query\":false,\"pessimistic-txn\":{\"max-retry-count\":256,\"deadlock-history-capacity\":10,\"deadlock-history-collect-retryable\":false,\"pessimistic-auto-commit\":false,\"constraint-check-in-place-pessimistic\":true},\"max-index-length\":3072,\"index-limit\":64,\"table-column-count-limit\":1017,\"graceful-wait-before-shutdown\":0,\"alter-primary-key\":false,\"treat-old-version-utf8-as-utf8mb4\":true,\"enable-table-lock\":false,\"delay-clean-table-lock\":0,\"split-region-max-num\":1000,\"top-sql\":{\"receiver-address\":\"\"},\"repair-mode\":false,\"repair-table-list\":[],\"isolation-read\":{\"engines\":[\"tikv\",\"tiflash\",\"tidb\"]},\"new_collations_enabled_on_first-bootstrap\":true,\"experimental\":{\"allow-expression-index\":false},\"skip-register-to-dashboard\":false,\"enable-telemetry\":false,\"labels\":{},\"enable-global-index\":false,\"deprecate-integer-display-length\":false,\"enable-enum-length-limit\":true,\"stores-refresh-interval\":60,\"enable-tcp4-only\":false,\"enable-forwarding\":false,\"max-ballast-object-size\":0,\"ballast-object-size\":0,\"transaction-summary\":{\"transaction-summary-capacity\":500,\"transaction-id-digest-min-duration\":2147483647},\"enable-global-kill\":true,\"initialize-sql-file\":\"\",\"enable-batch-dml\":false,\"mem-quota-query\":1073741824,\"oom-action\":\"log\",\"oom-use-tmp-storage\":true,\"check-mb4-value-in-utf8\":true,\"enable-collect-execution-info\":true,\"plugin\":{\"dir\":\"/data/deploy/plugin\",\"load\":\"\"},\"max-server-connections\":0,\"run-ddl\":true,\"disaggregated-tiflash\":false,\"autoscaler-type\":\"aws\",\"autoscaler-addr\":\"tiflash-autoscale-lb.tiflash-autoscale.svc.cluster.local:8081\",\"is-tiflashcompute-fixed-pool\":false,\"autoscaler-cluster-id\":\"\",\"use-autoscaler\":false,\"tidb-max-reuse-chunk\":64,\"tidb-max-reuse-column\":256,\"tidb-enable-exit-check\":false}"]

[2023/07/04 14:01:43.932 +08:00] [INFO] [main.go:394] ["disable Prometheus push client"]

[2023/07/04 14:01:43.932 +08:00] [INFO] [store.go:76] ["new store"] [path=tikv://10.xxx.xxx.181:2379]

[2023/07/04 14:01:43.932 +08:00] [INFO] [client.go:311] ["[pd] create pd client with endpoints and keyspace"] [pd-address="[10.xxx.xxx.181:2379]"] [keyspace-id=0]

[2023/07/04 14:01:43.932 +08:00] [INFO] [systime_mon.go:26] ["start system time monitor"]

[2023/07/04 14:01:43.932 +08:00] [ERROR] [cpu.go:65] [GetCgroupCPU] [error="no cpu controller detected"]

[2023/07/04 14:01:43.936 +08:00] [INFO] [pd_service_discovery.go:543] ["[pd] switch leader"] [new-leader=http://10.xxx.xxx.181:2379] [old-leader=]

[2023/07/04 14:01:43.936 +08:00] [INFO] [pd_service_discovery.go:175] ["[pd] init cluster id"] [cluster-id=7232570346524813659]

[2023/07/04 14:01:43.936 +08:00] [INFO] [client.go:386] ["[pd] changing service mode"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=PD_SVC_MODE]

[2023/07/04 14:01:43.936 +08:00] [INFO] [tso_client.go:230] ["[tso] switch dc tso allocator serving address"] [dc-location=global] [new-address=http://10.xxx.xxx.181:2379]

[2023/07/04 14:01:43.937 +08:00] [INFO] [tso_dispatcher.go:290] ["[tso] tso dispatcher created"] [dc-location=global]

[2023/07/04 14:01:43.937 +08:00] [INFO] [client.go:428] ["[pd] service mode changed"] [old-mode=PD_SVC_MODE] [new-mode=PD_SVC_MODE]

[2023/07/04 14:01:43.938 +08:00] [INFO] [tikv_driver.go:221] ["using API V1."]

[2023/07/04 14:01:43.939 +08:00] [INFO] [store.go:82] ["new store with retry success"]
  1. Specific situation of the missing region:
    » region 781
    {
    “id”: 781,
    “start_key”: “6D00000000000000F8”,
    “end_key”: “6E00000000000000F8”,
    “epoch”: {
    “conf_ver”: 299951,
    “version”: 145
    },
    “peers”: [
    {
    “id”: 1112034,
    “store_id”: 586630,
    “role_name”: “Voter”
    },
    {
    “id”: 1112132,
    “store_id”: 586632,
    “role_name”: “Voter”
    },
    {
    “id”: 1112141,
    “store_id”: 586634,
    “role_name”: “Voter”
    }
    ],
    “leader”: {
    “role_name”: “”
    },
    “cpu_usage”: 0,
    “written_bytes”: 0,
    “read_bytes”: 0,
    “written_keys”: 0,
    “read_keys”: 0,
    “approximate_size”: 0,
    “approximate_keys”: 0
    }
| username: 开心大河马 | Original post link

Help, how to handle this? Does it look like the region just hasn’t selected a leader, rather than actually being lost?

| username: h5n1 | Original post link

Most are available. If 1 out of 6 TiKV machines goes down and 3 TiKVs are lost, which is less than half, first expand one TiKV on another node.

| username: 开心大河马 | Original post link

The host maintenance did not consider the lower layers and directly shut down all the hosts. Currently, they are all up, and tikv is showing normal, but tidb is reporting an error when starting up.

| username: h5n1 | Original post link

At least 3 hosts are needed to ensure high availability.

| username: 开心大河马 | Original post link

Testing machines, normally it wouldn’t be done this way.

| username: tidb狂热爱好者 | Original post link

Find ucan in the group.

| username: redgame | Original post link

Direction: IP address, port number, PD address, etc., and ensure that TiDB can connect to TiKV normally.

| username: Timber | Original post link

You can try to rebuild the region on a TiKV node. First, shut down a TiKV instance, then on the machine of this instance, execute ./tikv-ctl --data-dir /data --config=/tikv.toml recreate-region -p {PD_ADDR} -r {region_id}. After successful execution, restart this instance.

| username: 我是咖啡哥 | Original post link

  1. Find regions without a leader

Input “region” in pd-ctl. Use a JSON formatting tool to format the result and find the region id and its store id corresponding to “role_name”:“”.

  1. Install tikv-ctl package on the tikv node
    Find the installation package ctl-v7.1.0-linux-amd64.tar.gz, extract it, locate tikv-ctl, and scp it to the tikv node.

  2. Stop the tikv node

tiup cluster stop test -N 192.168.1.184:20164
  1. Use tikv-ctl tool to recreate the corresponding region, pay attention to the data directory
./tikv-ctl --data-dir /opt/tidb/tidb-data/tikv-20165 --config=/opt/tidb/tidb-deploy/tikv-20165/conf/tikv.toml recreate-region -p 192.168.1.181:2379 -r 781

Outputting “success” indicates the recreation was successful. If not successful, try executing on another store.

  1. Start the tikv node
tiup cluster start test -N 192.168.1.184:20164
  1. After all regions are recreated, start tidb, and it’s done.
./tikv-ctl --data-dir /opt/tidb/tidb-data/tikv-20164 --config=/opt/tidb/tidb-deploy/tikv-20164/conf/tikv.toml recreate-region -p 192.168.1.181:2379 -r 781
| username: cassblanca | Original post link

Knowledge points learned.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.