TiDB Server Fails to Start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB Server拉不起来

| username: robert233

【TiDB Usage Environment】

  • Testing

【TiDB Version】

  • v5.1.4

【Encountered Issue】

  • 3 servers,
    3 tidb + 3 pd + 3 kv
  • Background:
    One of the data disks was cleaned, and it happened to be the tiup control machine (the machine completely crashed)
  • Recovery process: Manually supplemented topology.yaml based on cluster information, manually deployed but did not start, found that some pd nodes did not start, used pd-recovery tool to recover, then expanded tikv (from 2 normal nodes to 3), then tried to recover tidb, but it failed to start, with the following error:
[2022/10/28 17:45:02.538 +08:00] [FATAL] [terror.go:276] ["unexpected error"] [error="[privilege:8049]mysql.user"] [stack="github.com/pingcap/parser/terror.MustNil\n\t/root/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20210618053735-57843e8185c4/terror/terror.go:276\nmain.createStoreAndDomain\n\t/var/lib/docker/jenkins/workspace/build-common@4/go/src/github.com/pingcap/tidb/tidb-server/main.go:276\nmain.main\n\t/var/lib/docker/jenkins/workspace/build-common@4/go/src/github.com/pingcap/tidb/tidb-server/main.go:182\nruntime.main\n\t/usr/local/go1.16.4/src/runtime/proc.go:225"] [stack="github.com/pingcap/parser/terror.MustNil\n\t/root/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20210618053735-57843e8185c4/terror/terror.go:276\nmain.createStoreAndDomain\n\t/var/lib/docker/jenkins/workspace/build-common@4/go/src/github.com/pingcap/tidb/tidb-server/main.go:276\nmain.main\n\t/var/lib/docker/jenkins/workspace/build-common@4/go/src/github.com/pingcap/tidb/tidb-server/main.go:182\nruntime.main\n\t/usr/local/go1.16.4/src/runtime/proc.go:225"]

Can any experts help take a look?

| username: xfworld | Original post link

It seems like the metadata is lost. Might as well redeploy a new set.

| username: robert233 | Original post link

Are there any other ways?

| username: xfworld | Original post link

It’s quite troublesome; you need to check each node one by one and see if there is any damage. If there is, you can only recover from the damage.

You can try it, but I suggest reinstalling might be a better option.

| username: robert233 | Original post link

I checked the regions, and the missing replicas are all normal. Data recovery is also acceptable. Is there any documentation?

| username: xfworld | Original post link

Yes, first restore PD, then check TiKV. If the TiDB node is down, you can abandon it and redeploy a new one.

Wish you success~

| username: robert233 | Original post link

PD has already been restored, and TiKV has also been restored, but finally, TiDB cannot be started.

| username: robert233 | Original post link

The image is not visible. Please provide the text you need translated.

| username: xfworld | Original post link

Try adding a TiDB node.

| username: robert233 | Original post link

After I scaled down, adding new nodes still doesn’t work.

| username: xfworld | Original post link

What error? Check the logs.

| username: robert233 | Original post link

tidb.log:

[2022/10/29 14:33:41.324 +08:00] [INFO] [trackerRecorder.go:28] ["Mem Profile Tracker started"]
[2022/10/29 14:33:41.325 +08:00] [INFO] [printer.go:47] ["loaded config"] [config="{\"host\":\"0.0.0.0\",\"advertise-address\":\"10.246.177.103\",\"port\":4122,\"cors\":\"\",\"store\":\"tikv\",\"path\":\"10.246.177.103:2400,10.246.250.135:2400,10.246.177.102:2400\",\"socket\":\"\",\"lease\":\"45s\",\"run-ddl\":true,\"split-table\":true,\"token-limit\":1000,\"oom-use-tmp-storage\":true,\"tmp-storage-path\":\"/tmp/1000_tidb/MC4wLjAuMDo0MTIyLzAuMC4wLjA6MTAxMDI=/tmp-storage\",\"oom-action\":\"cancel\",\"mem-quota-query\":1073741824,\"tmp-storage-quota\":-1,\"enable-streaming\":false,\"enable-batch-dml\":false,\"lower-case-table-names\":2,\"server-version\":\"\",\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":null,\"enable-timestamp\":null,\"disable-error-stack\":null,\"enable-error-stack\":null,\"file\":{\"filename\":\"/home/tidb/tidb_deploy/tidb1/log/tidb.log\",\"max-size\":300,\"max-days\":0,\"max-backups\":0},\"enable-slow-log\":true,\"slow-query-file\":\"/home/tidb/tidb_deploy/tidb1/log/tidb_slow_query.log\",\"slow-threshold\":300,\"expensive-threshold\":10000,\"query-log-max-len\":4096,\"record-plan-in-slow-log\":1},\"security\":{\"skip-grant-table\":false,\"ssl-ca\":\"\",\"ssl-cert\":\"\",\"ssl-key\":\"\",\"require-secure-transport\":false,\"cluster-ssl-ca\":\"\",\"cluster-ssl-cert\":\"\",\"cluster-ssl-key\":\"\",\"cluster-verify-cn\":null,\"spilled-file-encryption-method\":\"plaintext\",\"enable-sem\":false},\"status\":{\"status-host\":\"0.0.0.0\",\"metrics-addr\":\"\",\"status-port\":10102,\"metrics-interval\":15,\"report-status\":true,\"record-db-qps\":false},\"performance\":{\"max-procs\":0,\"max-memory\":0,\"server-memory-quota\":0,\"memory-usage-alarm-ratio\":0.8,\"stats-lease\":\"3s\",\"stmt-count-limit\":5000,\"feedback-probability\":0,\"query-feedback-limit\":512,\"pseudo-estimate-ratio\":0.8,\"force-priority\":\"NO_PRIORITY\",\"bind-info-lease\":\"3s\",\"txn-entry-size-limit\":6291456,\"txn-total-size-limit\":104857600,\"tcp-keep-alive\":true,\"tcp-no-delay\":true,\"cross-join\":true,\"run-auto-analyze\":true,\"distinct-agg-push-down\":false,\"committer-concurrency\":128,\"max-txn-ttl\":3600000,\"mem-profile-interval\":\"1m\",\"index-usage-sync-lease\":\"0s\",\"gogc\":100,\"enforce-mpp\":false},\"prepared-plan-cache\":{\"enabled\":false,\"capacity\":100,\"memory-guard-ratio\":0.1},\"opentracing\":{\"enable\":false,\"rpc-metrics\":false,\"sampler\":{\"type\":\"const\",\"param\":1,\"sampling-server-url\":\"\",\"max-operations\":0,\"sampling-refresh-interval\":0},\"reporter\":{\"queue-size\":0,\"buffer-flush-interval\":0,\"log-spans\":false,\"local-agent-host-port\":\"\"}},\"proxy-protocol\":{\"networks\":\"\",\"header-timeout\":5},\"pd-client\":{\"pd-server-timeout\":3},\"tikv-client\":{\"grpc-connection-count\":4,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"grpc-compression-type\":\"none\",\"commit-timeout\":\"41s\",\"async-commit\":{\"keys-limit\":256,\"total-key-size-limit\":4096,\"safe-window\":2000000000,\"allowed-clock-drift\":500000000},\"max-batch-size\":128,\"overload-threshold\":200,\"max-batch-wait-time\":0,\"batch-wait-size\":8,\"enable-chunk-rpc\":true,\"region-cache-ttl\":600,\"store-limit\":0,\"store-liveness-timeout\":\"1s\",\"copr-cache\":{\"capacity-mb\":1000},\"ttl-refreshed-txn-size\":33554432},\"binlog\":{\"enable\":false,\"ignore-error\":false,\"write-timeout\":\"15s\",\"binlog-socket\":\"\",\"strategy\":\"range\"},\"compatible-kill-query\":false,\"plugin\":{\"dir\":\"\",\"load\":\"\"},\"pessimistic-txn\":{\"max-retry-count\":256,\"deadlock-history-capacity\":10},\"check-mb4-value-in-utf8\":true,\"max-index-length\":3072,\"index-limit\":64,\"table-column-count-limit\":1017,\"graceful-wait-before-shutdown\":0,\"alter-primary-key\":false,\"treat-old-version-utf8-as-utf8mb4\":true,\"enable-table-lock\":false,\"delay-clean-table-lock\":0,\"split-region-max-num\":1000,\"stmt-summary\":{\"enable\":true,\"enable-internal-query\":false,\"max-stmt-count\":3000,\"max-sql-length\":4096,\"refresh-interval\":1800,\"history-size\":24},\"repair-mode\":false,\"repair-table-list\":[],\"isolation-read\":{\"engines\":[\"tikv\",\"tiflash\",\"tidb\"]},\"max-server-connections\":0,\"new_collations_enabled_on_first-bootstrap\":false,\"experimental\":{\"allow-expression-index\":false},\"enable-collect-execution-info\":true,\"skip-register-to-dashboard\":false,\"enable-telemetry\":true,\"labels\":{},\"enable-global-index\":false,\"deprecate-integer-display-length\":false,\"enable-enum-length-limit\":true,\"stores-refresh-interval\":60,\"enable-tcp4-only\":false,\"enable-forwarding\":false}"]
[2022/10/29 14:33:41.325 +08:00] [INFO] [main.go:322] ["disable Prometheus push client"]
[2022/10/29 14:33:41.325 +08:00] [INFO] [store.go:68] ["new store"] [path=tikv://10.246.177.103:2400,10.246.250.135:2400,10.246.177.102:2400]
[2022/10/29 14:33:41.325 +08:00] [INFO] [client.go:214] ["[pd] create pd client with endpoints"] [pd-address="[10.246.177.103:2400,10.246.250.135:2400,10.246.177.102:2400]"]
[2022/10/29 14:33:41.325 +08:00] [INFO] [systime_mon.go:25] ["start system time monitor"]
[2022/10/29 14:33:41.330 +08:00] [INFO] [base_client.go:334] ["[pd] update member urls"] [old-urls="[http://10.246.177.103:2400,http://10.246.250.135:2400,http://10.246.177.102:2400]"] [new-urls="[http://10.246.177.102:2400,http://10.246.177.103:2400,http://10.246.250.135:2400]"]
[2022/10/29 14:33:41.330 +08:00] [INFO] [base_client.go:346] ["[pd] switch leader"] [new-leader=http://10.246.177.102:2400] [old-leader=]
[2022/10/29 14:33:41.330 +08:00] [INFO] [base_client.go:126] ["[pd] init cluster id"] [cluster-id=7137515427758526124]
[2022/10/29 14:33:41.330 +08:00] [INFO] [client.go:238] ["[pd] create tso dispatcher"] [dc-location=global]
[2022/10/29 14:33:41.333 +08:00] [INFO] [store.go:74] ["new store with retry success"]
[2022/10/29 14:33:41.340 +08:00] [INFO] [tidb.go:70] ["new domain"] [store=tikv-7137515427758526124] ["ddl lease"=45s] ["stats lease"=3s] ["index usage sync lease"=0s]
[2022/10/29 14:33:41.349 +08:00] [INFO] [ddl.go:342] ["[ddl] start DDL"] [ID=7d713e88-0239-40e4-a584-6277590defa0] [runWorker=true]
[2022/10/29 14:33:41.349 +08:00] [INFO] [manager.go:188] ["start campaign owner"] [ownerInfo="[ddl] /tidb/ddl/fg/owner"]
[2022/10/29 14:33:41.353 +08:00] [INFO] [ddl.go:331] ["[ddl] start delRangeManager OK"] ["is a emulator"=false]
[2022/10/29 14:33:41.353 +08:00] [INFO] [ddl_worker.go:134] ["[ddl] start DDL worker"] [worker="worker 1, tp general"]
[2022/10/29 14:33:41.354 +08:00] [INFO] [ddl_worker.go:134] ["[ddl] start DDL worker"] [worker="worker 2, tp add index"]
[2022/10/29 14:33:41.785 +08:00] [INFO] [domain.go:155] ["full load InfoSchema success"] [currentSchemaVersion=0] [neededSchemaVersion=5293] ["start time"=416.187081ms]
[2022/10/29 14:33:41.788 +08:00] [INFO] [domain.go:370] ["full load and reset schema validator"]
[2022/10/29 14:33:41.798 +08:00] [INFO] [manager.go:188] ["start campaign owner"] [ownerInfo="[bindinfo] /tidb/bindinfo/owner"]
[2022/10/29 14:33:41.798 +08:00] [WARN] [sysvar_cache.go:52] ["sysvar cache is empty, triggering rebuild"]
[2022/10/29 14:33:41.803 +08:00] [WARN] [cache.go:309] ["load mysql.user fail"] [error="[planner:1054]Unknown column 'create_role_priv' in 'field list'"]
[2022/10/29 14:33:41.803 +08:00] [FATAL] [terror.go:276] ["unexpected error"] [error="[privilege:8049]mysql.user"] [stack="github.com/pingcap/parser/terror.MustNil\n\t/root/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20210618053735-57843e8185c4/terror/terror.go:276\nmain.createStoreAndDomain\n\t/var/lib/docker/jenkins/workspace/build-common@4/go/src/github.com/pingcap/tidb/tidb-server/main.go:276\nmain.main\n\t/var/lib/docker/jenkins/workspace/build-common@4/go/src/github.com/pingcap/tidb/tidb-server/main.go:182\nruntime.main\n\t/usr/local/go1.16.4/src/runtime/proc.go:225"] [stack="github.com/pingcap/parser/terror.MustNil\n\t/root/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20210618053735-57843e8185c4/terror/terror.go:276\nmain.createStoreAndDomain\n\t/var/lib/docker/jenkins/workspace/build-common@4/go/src/github.com/pingcap/tidb/tidb-server/main.go:276\nmain.main\n\t/var/lib/docker/jenkins/workspace/build-common@4/go/src/github.com/pingcap/tidb/tidb-server/main.go:182\nruntime.main\n\t/usr/local/go1.16.4/src/runtime/proc.go:225"]
[2022/10/29 14:33:56.876 +08:00] [INFO] [printer.go:33] ["Welcome to TiDB."] ["Release Version"=v5.1.4] [Edition=Community] ["Git Commit Hash"=094b3e5e69d0921e2abe6907d217478bb5a7082d] ["Git Branch"=heads/refs/tags/v5.1.4] ["UTC Build Time"="2022-02-10 10:09:15"] [GoVersion=go1.16.4] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]
| username: xfworld | Original post link

“load mysql.user fail” [error=“[planner:1054]Unknown column ‘create_role_priv’ in ‘field list’”]

It seems the metadata is lost~ Rebuild it… :joy:

| username: robert233 | Original post link

Data is quite important, otherwise, it would have to be rebuilt. :joy:

| username: xfworld | Original post link

The system-level metadata is missing, so it’s basically impossible to start the service normally. :joy:

| username: robert233 | Original post link

:joy: I’ll take a look.

| username: xfworld | Original post link

If you use a new TiKV node, it might start, but the new node won’t have the previous data, and you won’t be able to retrieve that data.

For testing, try to use several VMs or physical machines. At least you can isolate them, and if one node fails, you can still scale and recover.

| username: robert233 | Original post link

There are three physical machines, each with its own component. Unfortunately, the one that went down is the TiUP control machine.

| username: xfworld | Original post link

Three machines are a bit too few. It would be better to set up a VM cluster and run TiDB on VMs. As long as the hardware configuration and network speed are sufficient, it will be better.

| username: robert233 | Original post link

I used another approach:

  1. Created a new cluster
  2. Both PD and TiKV were normal, so I used BR to do a full backup and restored the new cluster.