Failed to Scale Out TiFlash

translator_bot · June 23, 2024, 8:35am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 扩容tiflash失败

| username: kwongping2020

[TiDB Usage Environment] Production Environment
[TiDB Version] V6.1.0
[Problem Encountered] After upgrading the cluster from V5.2.2 to V6.1.0, tiflash generates an 11G core.* file every two minutes. Therefore, all tiflash nodes were scaled in and then scaled out again. During scaling out, it prompts a directory conflict Error: Deploy directory overlaps to another instance (spec.deploy.dir_overlap)
[Reproduction Path]
[Problem Phenomenon and Impact]

Scaling in

SELECT * FROM information_schema.tiflash_replica; and ALTER TABLE DB.TABLE SET tiflash replica 0;;
tiup cluster scale-in hrdb --node X.X.X.X:X and tiup cluster prune cluster

Scaling out
[root@tidb01 config-tidb]# cat scale-out-20220728-51.yaml
global:
user: “tidb”
ssh_port: 22
deploy_dir: “/home/tidb/tidb-deploy”
data_dir: “/home/tidb/tidb-data”
log_dir: “/home/tidb/tidb-logs”

tiflash_servers:
- host: X.X.X.X

[root@tidb01 config-tidb]# tiup cluster scale-out XXXX scale-out-20220728-51.yaml

[Attachment] Relevant logs and monitoring (https://metricstool.pingcap.com/)

If the question is related to performance optimization or fault troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results for upload.

translator_bot · June 23, 2024, 8:35am

| username: xfworld | Original post link

Is forced scaling down effective? Add the parameter --force.

If not, try the following:
Manual scaling down…

translator_bot · June 23, 2024, 8:35am

| username: kwongping2020 | Original post link

The above commands do not contain information about Tiflash, so the scale-down should have been successful.

translator_bot · June 23, 2024, 8:35am

| username: xfworld | Original post link

Congratulations~

translator_bot · June 23, 2024, 8:35am

| username: kwongping2020 | Original post link

But I am currently experiencing a scaling failure.

translator_bot · June 23, 2024, 8:35am

| username: xfworld | Original post link

Just re-expand it, anyway, if it fails, you know how to handle it.

translator_bot · June 23, 2024, 8:35am

| username: kwongping2020 | Original post link

The scale-down is complete, but now it’s impossible to scale up

translator_bot · June 23, 2024, 8:35am

| username: xfworld | Original post link

You need to clean up all the data directories on the original TiFlash node. All the initialization tasks must be done without exception…

Otherwise, it will be assumed that TiFlash is already running on the node, and expansion will not be possible…

translator_bot · June 23, 2024, 8:35am

| username: kwongping2020 | Original post link

The data related to TiFlash in the deploy_dir, data_dir, and log_dir directories has been cleaned up, but the expansion still failed.

translator_bot · June 23, 2024, 8:35am

| username: banana_jian | Original post link

You need to modify your configuration file. In the tiflash_servers module, define a separate folder for deploy_dir, data_dir, and log_dir that has not been used before.

translator_bot · June 23, 2024, 8:35am

| username: banana_jian | Original post link

For example:

tiflash_servers:

host: 192.168.135.148
tcp_port: 9001
http_port: 8124
flash_service_port: 3931
flash_proxy_port: 20171
flash_proxy_status_port: 20293
metrics_port: 8235
deploy_dir: “/tidb-deploy/tiflash-9001”
data_dir: “/tidb-data/tiflash-9001”
log_dir: “/tidb-deploy/tiflash-9001/log”

translator_bot · June 23, 2024, 8:35am

| username: kwongping2020 | Original post link

Specifying the directory also failed. After checking, it seems that at the very beginning of the deployment, the logs of each component were stored in the same directory log_dir: /home/tidb/tidb-logs.

translator_bot · June 23, 2024, 8:35am

| username: banana_jian | Original post link

Could you please remove the global part and try again, leaving only the tiflash_servers part like in my example?

translator_bot · June 23, 2024, 8:35am

| username: banana_jian | Original post link

Additionally, it seems that the log directories for all components in your current configuration are the same: log_dir: /home/tidb/tidb-logs. I’m not sure if this has any impact.

translator_bot · June 23, 2024, 8:35am

| username: kwongping2020 | Original post link

I have already tried it, but it still doesn’t work.

translator_bot · June 23, 2024, 8:35am

| username: kwongping2020 | Original post link

Hmm, I also suspect it might be due to this reason.
However, since it’s a production environment, using tiup cluster edit-config ${cluster-name} requires restarting the cluster, which is not feasible at the moment. Additionally, it’s not certain if this is the cause.

translator_bot · June 23, 2024, 8:35am

| username: 长安是只喵 | Original post link

Didn’t it say that versions before 5.3 cannot be upgraded online to versions after 5.3 for TiFlash?

translator_bot · June 23, 2024, 8:35am

| username: kwongping2020 | Original post link

The image you provided is not accessible. Please provide the text you need translated.

translator_bot · June 23, 2024, 8:35am

| username: 长安是只喵 | Original post link

Yes, that’s exactly what I meant. It seems a bit complicated, so it’s better to just delete the TiFlash replica and add it back after the upgrade.

translator_bot · June 23, 2024, 8:35am

| username: kwongping2020 | Original post link

I have now upgraded and deleted TiFlash, and I can no longer add TiFlash or other components. It keeps indicating a directory conflict.