Online Cluster TiKV Label Change

This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 在线集群tikv变更label

| username: 普罗米修斯

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.2.4
[Reproduction Path]
There are two production clusters, both running TiKV with dual instances on a single node. One cluster has labels correctly set at the host level, while the other has issues with labels (the labels for dual instances on a single node are inconsistent, causing multiple replicas to be lost if a single server fails). There are 4 servers with 8 TiKV nodes. Now, I want to re-label the problematic cluster to make the labels consistent for the same node.
[Encountered Issues: Symptoms and Impact]

  1. Can labels be reset in the production environment? The current cluster has a total capacity of 8T, with 4.8T already used. Will reconfiguring the labels affect the online business due to region scheduling on TiKV, or will there be other issues with setting labels online?
  2. If labels can be changed online, are the following steps correct (taking the same server as an example; host:tikv4 store:16 host:tikv5 store:15):
    a. tiup cluster edit-config xx
    b. Reload the cluster to refresh the configuration
    c. Use pdctl to operate store label 15 host tikv4
    [Resource Configuration]

| username: TI表弟 | Original post link

Considering stability, it is not recommended to use it this way. TiKV should have at least three nodes.

| username: TI表弟 | Original post link

Do not change the host, it will mess up all the metadata.

| username: TI表弟 | Original post link

It is recommended to solve the problem based on the current situation.

| username: 路在何chu | Original post link

Can it be done this way? You can only go online with the new TiKV first, then take it offline to make such changes, right?

| username: 普罗米修斯 | Original post link

Scaling up or down is a safe approach. First, ask in the community. If online modification is not suitable, then proceed with scaling up or down.

| username: 普罗米修斯 | Original post link

Before going offline the day before yesterday, we had 10 nodes, now we have 8 nodes.

| username: TI表弟 | Original post link

I think we should play on a single cluster.

| username: 普罗米修斯 | Original post link

It is a cluster.

| username: 随缘天空 | Original post link

Refer to the following link: TiUP 常见运维操作 | PingCAP 文档中心
First, use the command tiup cluster edit-config ${cluster-name} to open the configuration file. After modifying the parameters and saving, execute the reload command to reload the configuration.

| username: 普罗米修斯 | Original post link

Bro, your answer is not the main point. I still know how to edit the configuration.

| username: Jellybean | Original post link

Issue 1:

  1. Firstly, labels can be reset on the production cluster, but since you need to modify the label configuration of the cluster’s TiKV and PD through tiup edit-config, you need to reload and restart the cluster after the changes to let each group reload and apply the new configuration. After restarting, the cluster will generate scheduling policies to rebalance the regions (including leaders).
  2. Modifying configuration parameters requires restarting the cluster; it does not support online setting to take effect.
  3. If you do not want to restart the entire cluster, you can add -R pd,tikv during reload to restrict the restart to only PD and TiKV nodes.
  4. Restarting the cluster will be perceptible to the business, causing access jitter, increased latency, etc. Please communicate with the business in advance and perform the operation during off-peak hours.
  5. After restarting, when the cluster rebalances the labels, you can use the default scheduling test to let it automatically balance the data. The default configuration is relatively mild, and the speed may be slow, taking a long time. If acceleration is needed, you can adjust the scheduling configuration to speed up the balancing.

Issue 2:

  1. Online adjustment and effect are not supported.
  2. For TiKV nodes on the same machine, configure the same host label.

For specific operations, please refer to the official documentation:

| username: 普罗米修斯 | Original post link

I remember that reload is polling to refresh the configuration to the corresponding nodes. The cluster will not stop serving externally during the reload process.

| username: Jellybean | Original post link

That’s correct, the refresh is rotated to each node, with each node being refreshed and restarted one by one. The cluster service will not stop, and there are no major issues with overall operation.

The main consideration is that if the node being restarted is the PD leader or the Region leader on TiKV, there will be some jitter in the cluster’s external services, which the business might perceive. In short, this operation is not absolutely transparent to the business, and the reload plan should be formulated based on the business’s sensitivity to latency.

| username: 普罗米修斯 | Original post link

Let’s evaluate and perform the operation during a weekend low-peak period.

| username: TiDBer_gxUpi9Ct | Original post link

Learned it.

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.