Error Adding New TiDB Instance with TiUP

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiup add new tidb instance error

| username: Doslin

[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed to cause the issue

  1. tiup cluster scale-out xx tidb_flash_scaleout.yaml
  2. yaml content
[root@xx install_dir]# cat tidb_flash_scaleout.yaml 
tidb_servers:
  - host: 10.29.0.20
tiflash_servers:
  - host: 10.29.0.20

[Encountered Issue: Issue Phenomenon and Impact]
tiup cluster display xx

10.29.0.19:9000    tiflash       10.29.0.19   9000/8123/3930/20170/20292/8234  linux/x86_64  Tombstone  /data/tidb-data/tiflash-9000       /data/tidb-deploy/tiflash-9000
10.29.0.20:9000    tiflash       10.29.0.20   9000/8123/3930/20170/20292/8234  linux/x86_64  N/A        /data/tidb-data/tiflash-9000       /data/tidb-deploy/tiflash-9000

Error: failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s

Verbose debug logs have been written to /root/.tiup/logs/tiup-cluster-debug-2022-11-15-17-36-57.log.

[2022/11/15 17:35:58.383 +08:00] [INFO] [client.go:687] ["[pd] tso dispatcher created"] [dc-location=global]
[2022/11/15 17:35:58.383 +08:00] [INFO] [store.go:80] ["new store with retry success"]
[2022/11/15 17:35:58.384 +08:00] [FATAL] [session.go:3052] ["check bootstrapped failed"] [error="failed to decode region range key, key: \"6D426F6F7473747261FF704B657900000000FB0000000000000073\", err: invalid marker byte, group bytes \"9645__spl\""] [stack="github.com/pingcap/tidb/session.getStoreBootstrapVersion\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:3052\ngithub.com/pingcap/tidb/session.BootstrapSession\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2827\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:296\nmain.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:202\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]

[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]

| username: Christophe | Original post link

What version?

| username: Doslin | Original post link

V6.1.0

| username: Kongdom | Original post link

Complete the scaling configuration and try again

| username: Doslin | Original post link

Even after completion, the same issue persists

+ Initialize target host environments
+ Deploy TiDB instance
  - Deploy instance tidb -> 10.29.0.20:4000 ... Done
  - Deploy instance tiflash -> 10.29.0.20:9000 ... Done
+ Copy certificate to remote host
+ Generate scale-out config
  - Generate scale-out config tidb -> 10.29.0.20:4000 ... Done
  - Generate scale-out config tiflash -> 10.29.0.20:9000 ... Done
+ Init monitor config
+ Check status
Enabling component tidb
        Enabling instance 10.29.0.20:4000
        Enable instance 10.29.0.20:4000 success
Enabling component tiflash
        Enabling instance 10.29.0.20:9000
        Enable instance 10.29.0.20:9000 success
Enabling component node_exporter
        Enabling instance 10.29.0.20
        Enable 10.29.0.20 success
Enabling component blackbox_exporter
        Enabling instance 10.29.0.20
        Enable 10.29.0.20 success
+ [ Serial ] - Save meta
+ [ Serial ] - Start new instances
Starting component tidb
        Starting instance 10.29.0.20:4000

Error: failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s

LISTEN     0      32768     [::]:12020                 [::]:*                  \n", "stderr": "", "__hash__": "1a4714d7146fa85240a1ff4ef7451df719e0b4f0", "__func__": "github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute", "hit": false}
2022-11-16T08:55:58.852+0800    DEBUG   retry error     {"error": "operation timed out after 2m0s"}
2022-11-16T08:55:58.852+0800    DEBUG   TaskFinish      {"task": "Start new instances", "error": "failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s", "errorVerbose": "timed out waiting for port 4000 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:119\ngithub.com/pingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:151\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:405\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:534\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594\nfailed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.\nfailed to start tidb"}
2022-11-16T08:55:58.852+0800    INFO    Execute command finished        {"code": 1, "error": "failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s", "errorVerbose": "timed out waiting for port 4000 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:119\ngithub.com/pingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:151\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:405\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:534\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594\nfailed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.\nfailed to start tidb"}
| username: Kongdom | Original post link

Please provide the log under this path:
/data/tidb-deploy/tidb-4000/log

| username: Doslin | Original post link

This is what I posted before

[2022/11/15 17:35:58.383 +08:00] [INFO] [client.go:687] ["[pd] tso dispatcher created"] [dc-location=global]
[2022/11/15 17:35:58.383 +08:00] [INFO] [store.go:80] ["new store with retry success"]
[2022/11/15 17:35:58.384 +08:00] [FATAL] [session.go:3052] ["check bootstrapped failed"] [error="failed to decode region range key, key: \"6D426F6F7473747261FF704B657900000000FB0000000000000073\", err: invalid marker byte, group bytes \"9645__spl\""] [stack="github.com/pingcap/tidb/session.getStoreBootstrapVersion\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:3052\ngithub.com/pingcap/tidb/session.BootstrapSession\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2827\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:296\nmain.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:202\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]
| username: WalterWj | Original post link

Why does it feel like the TiKV region is corrupted?

| username: Jiawei | Original post link

Before scaling, did you check if there are any risks in the cluster and fix them?
Or should we try scaling TiFlash and TiDB separately?

| username: WalterWj | Original post link

Based on your display, the TiFlash node 19 should be offline, right? Are there still TiFlash replicas and nodes in the entire cluster?

| username: Doslin | Original post link

Before scaling, did you check if there are any risks in the cluster and fix them?

Checked, all risks are already known.

Node        Check         Result  Message
----        -----         ------  -------
10.29.0.20  memory        Pass    memory size is 32768MB
10.29.0.20  limits        Fail    soft limit of 'stack' for user 'root' is not set or too low
10.29.0.20  limits        Fail    soft limit of 'nofile' for user 'root' is not set or too low
10.29.0.20  limits        Fail    hard limit of 'nofile' for user 'root' is not set or too low
10.29.0.20  thp           Pass    THP is disabled
10.29.0.20  service       Fail    service irqbalance is not running
10.29.0.20  command       Pass    numactl: policy: default
10.29.0.20  timezone      Pass    time zone is the same as the first PD machine: Asia/Shanghai
10.29.0.20  os-version    Fail    os vendor alinux not supported
10.29.0.20  cpu-cores     Pass    number of CPU cores / threads: 8
10.29.0.20  cpu-governor  Warn    Unable to determine current CPU frequency governor policy
10.29.0.20  selinux       Pass    SELinux is disabled
| username: Doslin | Original post link

The TiFlash of version 19 has been taken offline, and there are no TiFlash replicas in the entire cluster.

| username: WalterWj | Original post link

select * from information_schema.tiflash_replica to confirm if there are no tiflash replicas.

After tiflash is taken offline, theoretically, it should not be visible in the display. Is it possible that tiup prune was not used? tiup cluster prune | PingCAP 文档中心

| username: WalterWj | Original post link

Moreover, the key in the log indeed cannot be decoded:

(root@127.0.0.1) [(none)]>select tidb_decode_key("6D426F6F7473747261FF704B657900000000FB0000000000000073");
+---------------------------------------------------------------------------+
| tidb_decode_key("6D426F6F7473747261FF704B657900000000FB0000000000000073") |
+---------------------------------------------------------------------------+
| 6D426F6F7473747261FF704B657900000000FB0000000000000073                    |
+---------------------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)

(root@127.0.0.1) [(none)]>show warnings;
+---------+------+---------------------------------------------------------------------+
| Level   | Code | Message                                                             |
+---------+------+---------------------------------------------------------------------+
| Warning | 1105 | invalid key: 6D426F6F7473747261FF704B657900000000FB0000000000000073 |
+---------+------+---------------------------------------------------------------------+
1 row in set (0.00 sec)
| username: Doslin | Original post link

Let me explain my operation process again.
Execution 1: I have now uninstalled all TiFlash and TiDB in the cluster. When I execute tiup cluster display xx, I can’t see TiDB and TiFlash. On this basis, I tried to install TiDB and TiFlash, but I got the above error.
Execution 2: When I execute tiup cluster display xx, I see one TiDB and one TiFlash, both in UP status. On this basis, I tried to install TiDB and TiFlash, but I got the above error.

Even if I complete the configuration, I still get the above error.

| username: Minorli-PingCAP | Original post link

Is there production data in the cluster?