Failed to Scale Out TiFlash

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 扩容tiflash失败

| username: kwongping2020

[TiDB Usage Environment] Production Environment
[TiDB Version] V6.1.0
[Problem Encountered] After upgrading the cluster from V5.2.2 to V6.1.0, tiflash generates an 11G core.* file every two minutes. Therefore, all tiflash nodes were scaled in and then scaled out again. During scaling out, it prompts a directory conflict Error: Deploy directory overlaps to another instance (spec.deploy.dir_overlap)
[Reproduction Path]
[Problem Phenomenon and Impact]

  • Scaling in
  1. SELECT * FROM information_schema.tiflash_replica; and ALTER TABLE DB.TABLE SET tiflash replica 0;;
  2. tiup cluster scale-in hrdb --node X.X.X.X:X and tiup cluster prune cluster
  • Scaling out
    [root@tidb01 config-tidb]# cat scale-out-20220728-51.yaml
    global:
    user: “tidb”
    ssh_port: 22
    deploy_dir: “/home/tidb/tidb-deploy”
    data_dir: “/home/tidb/tidb-data”
    log_dir: “/home/tidb/tidb-logs”

    tiflash_servers:

    • host: X.X.X.X

[root@tidb01 config-tidb]# tiup cluster scale-out XXXX scale-out-20220728-51.yaml

[Attachment] Relevant logs and monitoring (https://metricstool.pingcap.com/)


If the question is related to performance optimization or fault troubleshooting, please download the script and run it. Please select all and copy-paste the terminal output results for upload.

| username: xfworld | Original post link

Is forced scaling down effective? Add the parameter --force.

If not, try the following:
Manual scaling down…

| username: kwongping2020 | Original post link

The above commands do not contain information about Tiflash, so the scale-down should have been successful.

| username: xfworld | Original post link

Congratulations~ :cowboy_hat_face:

| username: kwongping2020 | Original post link

:sweat_smile: But I am currently experiencing a scaling failure.

| username: xfworld | Original post link

Just re-expand it, anyway, if it fails, you know how to handle it.

| username: kwongping2020 | Original post link

The scale-down is complete, but now it’s impossible to scale up :skull_and_crossbones:

| username: xfworld | Original post link

You need to clean up all the data directories on the original TiFlash node. All the initialization tasks must be done without exception…

Otherwise, it will be assumed that TiFlash is already running on the node, and expansion will not be possible…

| username: kwongping2020 | Original post link

The data related to TiFlash in the deploy_dir, data_dir, and log_dir directories has been cleaned up, but the expansion still failed.

| username: banana_jian | Original post link

You need to modify your configuration file. In the tiflash_servers module, define a separate folder for deploy_dir, data_dir, and log_dir that has not been used before.

| username: banana_jian | Original post link

For example:

tiflash_servers:

  • host: 192.168.135.148
    tcp_port: 9001
    http_port: 8124
    flash_service_port: 3931
    flash_proxy_port: 20171
    flash_proxy_status_port: 20293
    metrics_port: 8235
    deploy_dir: “/tidb-deploy/tiflash-9001”
    data_dir: “/tidb-data/tiflash-9001”
    log_dir: “/tidb-deploy/tiflash-9001/log”
| username: kwongping2020 | Original post link

Specifying the directory also failed. After checking, it seems that at the very beginning of the deployment, the logs of each component were stored in the same directory log_dir: /home/tidb/tidb-logs.

| username: banana_jian | Original post link

Could you please remove the global part and try again, leaving only the tiflash_servers part like in my example?

| username: banana_jian | Original post link

Additionally, it seems that the log directories for all components in your current configuration are the same: log_dir: /home/tidb/tidb-logs. I’m not sure if this has any impact.

| username: kwongping2020 | Original post link

I have already tried it, but it still doesn’t work.

| username: kwongping2020 | Original post link

Hmm, I also suspect it might be due to this reason.
However, since it’s a production environment, using tiup cluster edit-config ${cluster-name} requires restarting the cluster, which is not feasible at the moment. Additionally, it’s not certain if this is the cause.

| username: 长安是只喵 | Original post link

Didn’t it say that versions before 5.3 cannot be upgraded online to versions after 5.3 for TiFlash?

| username: kwongping2020 | Original post link

The image you provided is not accessible. Please provide the text you need translated.

| username: 长安是只喵 | Original post link

Yes, that’s exactly what I meant. It seems a bit complicated, so it’s better to just delete the TiFlash replica and add it back after the upgrade.

| username: kwongping2020 | Original post link

I have now upgraded and deleted TiFlash, and I can no longer add TiFlash or other components. It keeps indicating a directory conflict.