Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: tiup add new tidb instance error
[TiDB Usage Environment] Production Environment / Testing / Poc
[TiDB Version]
[Reproduction Path] What operations were performed to cause the issue
- tiup cluster scale-out xx tidb_flash_scaleout.yaml
- yaml content
[root@xx install_dir]# cat tidb_flash_scaleout.yaml
tidb_servers:
- host: 10.29.0.20
tiflash_servers:
- host: 10.29.0.20
[Encountered Issue: Issue Phenomenon and Impact]
tiup cluster display xx
10.29.0.19:9000 tiflash 10.29.0.19 9000/8123/3930/20170/20292/8234 linux/x86_64 Tombstone /data/tidb-data/tiflash-9000 /data/tidb-deploy/tiflash-9000
10.29.0.20:9000 tiflash 10.29.0.20 9000/8123/3930/20170/20292/8234 linux/x86_64 N/A /data/tidb-data/tiflash-9000 /data/tidb-deploy/tiflash-9000
Error: failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s
Verbose debug logs have been written to /root/.tiup/logs/tiup-cluster-debug-2022-11-15-17-36-57.log.
[2022/11/15 17:35:58.383 +08:00] [INFO] [client.go:687] ["[pd] tso dispatcher created"] [dc-location=global]
[2022/11/15 17:35:58.383 +08:00] [INFO] [store.go:80] ["new store with retry success"]
[2022/11/15 17:35:58.384 +08:00] [FATAL] [session.go:3052] ["check bootstrapped failed"] [error="failed to decode region range key, key: \"6D426F6F7473747261FF704B657900000000FB0000000000000073\", err: invalid marker byte, group bytes \"9645__spl\""] [stack="github.com/pingcap/tidb/session.getStoreBootstrapVersion\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:3052\ngithub.com/pingcap/tidb/session.BootstrapSession\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2827\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:296\nmain.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:202\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
Complete the scaling configuration and try again
Even after completion, the same issue persists
+ Initialize target host environments
+ Deploy TiDB instance
- Deploy instance tidb -> 10.29.0.20:4000 ... Done
- Deploy instance tiflash -> 10.29.0.20:9000 ... Done
+ Copy certificate to remote host
+ Generate scale-out config
- Generate scale-out config tidb -> 10.29.0.20:4000 ... Done
- Generate scale-out config tiflash -> 10.29.0.20:9000 ... Done
+ Init monitor config
+ Check status
Enabling component tidb
Enabling instance 10.29.0.20:4000
Enable instance 10.29.0.20:4000 success
Enabling component tiflash
Enabling instance 10.29.0.20:9000
Enable instance 10.29.0.20:9000 success
Enabling component node_exporter
Enabling instance 10.29.0.20
Enable 10.29.0.20 success
Enabling component blackbox_exporter
Enabling instance 10.29.0.20
Enable 10.29.0.20 success
+ [ Serial ] - Save meta
+ [ Serial ] - Start new instances
Starting component tidb
Starting instance 10.29.0.20:4000
Error: failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s
LISTEN 0 32768 [::]:12020 [::]:* \n", "stderr": "", "__hash__": "1a4714d7146fa85240a1ff4ef7451df719e0b4f0", "__func__": "github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute", "hit": false}
2022-11-16T08:55:58.852+0800 DEBUG retry error {"error": "operation timed out after 2m0s"}
2022-11-16T08:55:58.852+0800 DEBUG TaskFinish {"task": "Start new instances", "error": "failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s", "errorVerbose": "timed out waiting for port 4000 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:119\ngithub.com/pingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:151\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:405\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:534\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594\nfailed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.\nfailed to start tidb"}
2022-11-16T08:55:58.852+0800 INFO Execute command finished {"code": 1, "error": "failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s", "errorVerbose": "timed out waiting for port 4000 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:119\ngithub.com/pingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:151\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:405\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:534\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594\nfailed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.\nfailed to start tidb"}
Please provide the log under this path:
/data/tidb-deploy/tidb-4000/log
This is what I posted before
[2022/11/15 17:35:58.383 +08:00] [INFO] [client.go:687] ["[pd] tso dispatcher created"] [dc-location=global]
[2022/11/15 17:35:58.383 +08:00] [INFO] [store.go:80] ["new store with retry success"]
[2022/11/15 17:35:58.384 +08:00] [FATAL] [session.go:3052] ["check bootstrapped failed"] [error="failed to decode region range key, key: \"6D426F6F7473747261FF704B657900000000FB0000000000000073\", err: invalid marker byte, group bytes \"9645__spl\""] [stack="github.com/pingcap/tidb/session.getStoreBootstrapVersion\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:3052\ngithub.com/pingcap/tidb/session.BootstrapSession\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2827\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:296\nmain.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:202\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]
Why does it feel like the TiKV region is corrupted?
Before scaling, did you check if there are any risks in the cluster and fix them?
Or should we try scaling TiFlash and TiDB separately?
Based on your display, the TiFlash node 19 should be offline, right? Are there still TiFlash replicas and nodes in the entire cluster?
Before scaling, did you check if there are any risks in the cluster and fix them?
Checked, all risks are already known.
Node Check Result Message
---- ----- ------ -------
10.29.0.20 memory Pass memory size is 32768MB
10.29.0.20 limits Fail soft limit of 'stack' for user 'root' is not set or too low
10.29.0.20 limits Fail soft limit of 'nofile' for user 'root' is not set or too low
10.29.0.20 limits Fail hard limit of 'nofile' for user 'root' is not set or too low
10.29.0.20 thp Pass THP is disabled
10.29.0.20 service Fail service irqbalance is not running
10.29.0.20 command Pass numactl: policy: default
10.29.0.20 timezone Pass time zone is the same as the first PD machine: Asia/Shanghai
10.29.0.20 os-version Fail os vendor alinux not supported
10.29.0.20 cpu-cores Pass number of CPU cores / threads: 8
10.29.0.20 cpu-governor Warn Unable to determine current CPU frequency governor policy
10.29.0.20 selinux Pass SELinux is disabled
The TiFlash of version 19 has been taken offline, and there are no TiFlash replicas in the entire cluster.
select * from information_schema.tiflash_replica to confirm if there are no tiflash replicas.
After tiflash is taken offline, theoretically, it should not be visible in the display. Is it possible that tiup prune was not used? tiup cluster prune | PingCAP 文档中心
Moreover, the key in the log indeed cannot be decoded:
(root@127.0.0.1) [(none)]>select tidb_decode_key("6D426F6F7473747261FF704B657900000000FB0000000000000073");
+---------------------------------------------------------------------------+
| tidb_decode_key("6D426F6F7473747261FF704B657900000000FB0000000000000073") |
+---------------------------------------------------------------------------+
| 6D426F6F7473747261FF704B657900000000FB0000000000000073 |
+---------------------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)
(root@127.0.0.1) [(none)]>show warnings;
+---------+------+---------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+---------------------------------------------------------------------+
| Warning | 1105 | invalid key: 6D426F6F7473747261FF704B657900000000FB0000000000000073 |
+---------+------+---------------------------------------------------------------------+
1 row in set (0.00 sec)
Let me explain my operation process again.
Execution 1: I have now uninstalled all TiFlash and TiDB in the cluster. When I execute tiup cluster display xx
, I can’t see TiDB and TiFlash. On this basis, I tried to install TiDB and TiFlash, but I got the above error.
Execution 2: When I execute tiup cluster display xx
, I see one TiDB and one TiFlash, both in UP status. On this basis, I tried to install TiDB and TiFlash, but I got the above error.
Even if I complete the configuration, I still get the above error.
Is there production data in the cluster?