TiKV cannot start offline

translator_bot July 2, 2024, 4:28am 1

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv离线起不来

| username: 今天不想写代码

[TiDB Usage Environment] Production Environment
[TiDB Version] 6.2.0
[Encountered Problem: Symptoms and Impact]

The TiKV in the online environment went offline and could not be started after using tiup cluster.

translator_bot July 11, 2024, 9:35am 2

| username: TiDBer_jYQINSnf | Original post link

Rebuild this node, it’s the simplest and most reliable method.

translator_bot July 11, 2024, 9:35am 3

| username: zhanggame1 | Original post link

Scaling up or down, there’s no better way.

translator_bot July 11, 2024, 9:35am 4

| username: 像风一样的男子 | Original post link

Your version 6.2 is a DMR version and not suitable for production use.

translator_bot July 11, 2024, 9:35am 5

| username: ffeenn | Original post link

Check if the node’s disk space is full or if there are no write permissions?

translator_bot July 11, 2024, 9:35am 6

| username: Kongdom | Original post link

Doesn’t the appearance of “welcome” mean that it has started up?

translator_bot July 11, 2024, 9:35am 7

| username: TiDBer_jYQINSnf | Original post link

“Welcome” is the first sentence, and “ready to serve” indicates that it has truly started up.
In this case, it is continuously restarting.

translator_bot July 11, 2024, 9:35am 8

| username: xfworld | Original post link

It doesn’t have much impact, just shrink it and expand it again.

translator_bot July 11, 2024, 9:35am 9

| username: TiDBer_QKDdYGfz | Original post link

Following, it’s really scary to perform scaling down in production.

translator_bot July 11, 2024, 9:35am 10

| username: Kongdom | Original post link

I hadn’t noticed this detail before~

translator_bot July 11, 2024, 9:35am 11

| username: 希希希望啊 | Original post link

No big deal, shrink the node and then expand it again.

translator_bot July 11, 2024, 9:35am 12

| username: lemonade010 | Original post link

Was the offline time too long? Did it cause the log to be overwritten?

translator_bot July 11, 2024, 9:35am 13

| username: zhaokede | Original post link

Focus on solving the problem.
You can try this: first perform a hard backup of the operating system, then reduce and expand the capacity on the hard backup. It’s safer this way.

translator_bot July 11, 2024, 9:35am 14

| username: TiDBer_ZxWlj6A1 | Original post link

Is it really that fragile? Scaling up and down seems quite troublesome.

translator_bot July 11, 2024, 9:35am 15

| username: 呢莫不爱吃鱼 | Original post link

First scale down, then scale up.

translator_bot July 11, 2024, 9:35am 16

| username: 小于同学 | Original post link

Upgrade the version.

translator_bot July 11, 2024, 9:35am 17

| username: 今天不想写代码 | Original post link

I tried to shrink it, but it’s been two hours and it’s still not done. Is it broken and unable to shrink?

translator_bot July 11, 2024, 9:35am 18

| username: 像风一样的男子 | Original post link

How many nodes do you have in total? Don’t tell me you only have three nodes and one of them is down?

translator_bot July 11, 2024, 9:35am 19

| username: xfworld | Original post link

No, before you shrink, you need to check if the number of nodes is sufficient. If not, you need to expand first, then shrink.

translator_bot July 11, 2024, 9:35am 20

| username: TiDBer_jYQINSnf | Original post link

Even if it’s broken, it can still shrink. When you execute store, the regions on it will gradually decrease. Other TiKV nodes will replenish the replicas.
If there is still a leader on it, then your cluster has a problem.