TiKV Scaling Down Causes TiDB Connection Interruptions

translator_bot · June 22, 2024, 4:41am

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv缩容，影响tidb连接中断

| username: TiDB_C罗

[TiDB Usage Environment] Production Environment
[TiDB Version] v5.4.0
[Reproduction Path] TiKV Disk Shrinkage
[Encountered Problem: Phenomenon and Impact]
[Attachment: Screenshot/Log/Monitoring]
Steps to shrink TiKV disk:

Stop TiKV first
Copy data
Restart TiKV
Stop TiKV first

tiup cluster stop tidb-risk -N 10.0.0.17:20160

Check cluster status, shows Disconnected

ID                    Role          Host            Ports        OS/Arch       Status        Data Dir                           Deploy Dir
--                    ----          ----            -----        -------       ------        --------                           ----------
10.0.0.10:9093   alertmanager  10.0.0.10  9093/9094    linux/x86_64  Up            /data/tidb-data/alertmanager-9093  /data/tidb-deploy/alertmanager-9093
10.0.0.10:3000   grafana       10.0.0.10  3000         linux/x86_64  Up            -                                  /data/tidb-deploy/grafana-3000
10.0.0.11:2379   pd            10.0.0.11  2379/2380    linux/x86_64  Up|L          /data/tidb-data/pd-2379            /data/tidb-deploy/pd-2379
10.0.0.12:2379   pd            10.0.0.12  2379/2380    linux/x86_64  Up            /data/tidb-data/pd-2379            /data/tidb-deploy/pd-2379
10.0.0.13:2379    pd            10.0.0.13   2379/2380    linux/x86_64  Up|UI         /data/tidb-data/pd-2379            /data/tidb-deploy/pd-2379
10.0.0.10:9090   prometheus    10.0.0.10  9090/12020   linux/x86_64  Up            /data/tidb-data/prometheus-9090    /data/tidb-deploy/prometheus-9090
10.0.0.14:4000   tidb          10.0.0.14  4000/10080   linux/x86_64  Up            -                                  /data/tidb-deploy/tidb-4000
10.0.0.15:4000   tidb          10.0.0.15  4000/10080   linux/x86_64  Up            -                                  /data/tidb-deploy/tidb-4000
10.0.0.16:4000    tidb          10.0.0.16   4000/10080   linux/x86_64  Up            -                                  /data/tidb-deploy/tidb-4000
10.0.0.17:20160  tikv          10.0.0.17  20160/20180  linux/x86_64  Disconnected  /data/tidb-data/tikv-20160         /data/tidb-deploy/tikv-20160
10.0.0.18:20160  tikv          10.0.0.18  20160/20180  linux/x86_64  Up            /data/tidb-data/tikv-20160         /data/tidb-deploy/tikv-20160
10.0.0.19:20160   tikv          10.0.0.19   20160/20180  linux/x86_64  Up            /data/tidb-data/tikv-20160         /data/tidb-deploy/tikv-20160

Copy data
cp -a /data /data1
Start TiKV
tiup cluster start tidb-risk -N 10.0.0.17:20160

Check TiKV detail

When one TiKV is shut down, a leader election occurs, and after recovery, each node balances the leader again.

At this time, the business receives an alert:
com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure

The last packet successfully received from the server was 10,674 milliseconds ago. The last packet sent successfully to the server was 10,674 milliseconds ago.
at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:174)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:64)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:953)
at com.mysql.cj.jdbc.ClientPreparedStatement.execute(ClientPreparedStatement.java:370)
at com.opay.realtime.etl.util.JdbcUtil.riskSink(JdbcUtil.scala:142)
at com.opay.risk.features.sink.TableSinkMappingBroadcastProcessExt$TableSinkMappingProcess.doProcess(TableSinkMappingBroadcastProcessExt.scala:34)
at com.opay.risk.features.sink.TableSinkMappingBroadcastProcessFunction$$anonfun$processElement$1.apply(TableSinkMappingBroadcastProcessFunction.scala:64)
at com.opay.risk.features.sink.TableSinkMappingBroadcastProcessFunction$$anonfun$processElement$1.apply(TableSinkMappingBroadcastProcessFunction.scala:63)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
at com.opay.risk.features.sink.TableSinkMappingBroadcastProcessFunction.processElement(TableSinkMappingBroadcastProcessFunction.scala:63)
at com.opay.risk.features.sink.TableSinkMappingBroadcastProcessFunction.processElement(TableSinkMappingBroadcastProcessFunction.scala:24)
at org.apache.flink.streaming.api.operators.co.CoBroadcastWithNonKeyedOperator.processElement1(CoBroadcastWithNonKeyedOperator.java:110)
at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessorFactory.processRecord1(StreamTwoInputProcessorFactory.java:213)
at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessorFactory.lambda$create$0(StreamTwoInputProcessorFactory.java:178)
at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessorFactory$StreamTaskNetworkOutput.emitRecord(StreamTwoInputProcessorFactory.java:291)
at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
at org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:96)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:423)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:684)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:639)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
at java.lang.Thread.run(Thread.java:750)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure

The last packet successfully received from the server was 10,674 milliseconds ago. The last packet sent successfully to the server was 10,674 milliseconds ago.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:61)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:105)
at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:151)
at com.mysql.cj.exceptions.ExceptionFactory.createCommunicationsException(ExceptionFactory.java:167)
at com.mysql.cj.protocol.a.NativeProtocol.readMessage(NativeProtocol.java:546)
at com.mysql.cj.protocol.a.NativeProtocol.checkErrorMessage(NativeProtocol.java:710)
at com.mysql.cj.protocol.a.NativeProtocol.sendCommand(NativeProtocol.java:649)
at com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:948)
at com.mysql.cj.NativeSession.execSQL(NativeSession.java:1075)
at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:930)
… 25 more
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:67)
at com.mysql.cj.protocol.a.SimplePacketReader.readHeader(SimplePacketReader.java:63)
at com.mysql.cj.protocol.a.SimplePacketReader.readHeader(SimplePacketReader.java:45)
at com.mysql.cj.protocol.a.TimeTrackingPacketReader.readHeader(TimeTrackingPacketReader.java:52)
at com.mysql.cj.protocol.a.TimeTrackingPacketReader.readHeader(TimeTrackingPacketReader.java:41)
at com.mysql.cj.protocol.a.MultiPacketReader.readHeader(MultiPacketReader.java:54)
at com.mysql.cj.protocol.a.MultiPacketReader.readHeader(MultiPacketReader.java:44)
at com.mysql.cj.protocol.a.NativeProtocol.readMessage(NativeProtocol.java:540)
… 30 more

Connection parameters jdbc:mysql://10.0.0.11:3306/orders?useSSL=false&rewriteBatchedStatements=true&autoReconnect=true
This 3306 is haproxy started on the PD node, proxying the three tidb-4000 nodes behind it.

My understanding is that operating TiKV caused a leader election, and requests went to the new store, which should not affect the connection errors on TiDB. Why does this affect the connection interruption on TiDB?

translator_bot · June 22, 2024, 4:41am

| username: tidb菜鸟一只 | Original post link

Is the scaling down related to storage?

translator_bot · June 22, 2024, 4:41am

| username: 有猫万事足 | Original post link

It should be that your TiKV temporarily does not meet the minimum requirement of 3 replicas.

translator_bot · June 22, 2024, 4:41am

| username: xingzhenxiang | Original post link

TiDB defaults to three replicas, but you have reduced it to two TiKV nodes.

translator_bot · June 22, 2024, 4:41am

| username: h5n1 | Original post link

First, check the logs of those three TiDB servers when the error occurred.

translator_bot · June 22, 2024, 4:41am

| username: TiDB_C罗 | Original post link

Just stop the instance and replace it with a smaller disk.

translator_bot · June 22, 2024, 4:41am

| username: TiDB_C罗 | Original post link

The deployment requires at least three nodes, allowing for one node to be unavailable. This is exactly my situation.

translator_bot · June 22, 2024, 4:41am

| username: TiDB_C罗 | Original post link

TiKV error messages:

[2023/07/04 01:33:23.989 +00:00] [INFO] [signal_handler.rs:19] ["receive signal 15, stopping server..."]
[2023/07/04 01:33:24.043 +00:00] [INFO] [<unknown>] ["ipv4:10.0.0.14:39746: Sending goaway err={\"created\":\"@1688434404.043286523\",\"description\":\"Server shutdown\",\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/surface/server.cc\",\"file_line\":480,\"grpc_status\":0}"]
[2023/07/04 01:33:24.043 +00:00] [INFO] [<unknown>] ["ipv4:10.0.0.14:39742: Sending goaway err={\"created\":\"@1688434404.043293155\",\"description\":\"Server shutdown\",\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/surface/server.cc\",\"file_line\":480,\"grpc_status\":0}"]
[2023/07/04 01:33:24.044 +00:00] [INFO] [<unknown>] ["ipv4:10.0.0.14:39740: Sending goaway err={\"created\":\"@1688434404.043294237\",\"description\":\"Server shutdown\",\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/surface/server.cc\",\"file_line\":480,\"grpc_status\":0}"]
[2023/07/04 01:33:24.044 +00:00] [INFO] [<unknown>] ["ipv4:10.0.0.14:39744: Sending goaway err={\"created\":\"@1688434404.043296301\",\"description\":\"Server shutdown\",\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/surface/server.cc\",\"file_line\":480,\"grpc_status\":0}"]
[2023/07/04 01:33:24.044 +00:00] [INFO] [<unknown>] ["ipv4:10.0.0.15:60918: Sending goaway err={\"created\":\"@1688434404.043297724\",\"description\":\"Server shutdown\",\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/surface/server.cc\",\"file_line\":480,\"grpc_status\":0}"]
[2023/07/04 01:33:24.044 +00:00] [INFO] [<unknown>] ["ipv4:10.0.0.15:60912: Sending goaway err={\"created\":\"@1688434404.043299056\",\"description\":\"Server shutdown\",\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/surface/server.cc\",\"file_line\":480,\"grpc_status\":0}"]
[2023/07/04 01:33:24.044 +00:00] [INFO] [<unknown>] ["ipv4:10.0.0.15:60916: Sending goaway err={\"created\":\"@1688434404.043299958\",\"description\":\"Server shutdown\",\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/surface/server.cc\",\"file_line\":480,\"grpc_status\":0}"]
[2023/07/04 01:33:24.044 +00:00] [INFO] [<unknown>] ["ipv4:10.0.0.15:60914: Sending goaway err={\"created\":\"@1688434404.043311610\",\"description\":\"Server shutdown\",\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/surface/server.cc\",\"file_line\":480,\"grpc_status\":0}"]
[2023/07/04 01:33:24.044 +00:00] [INFO] [<unknown>] ["ipv4:10.0.0.16:41874: Sending goaway err={\"created\":\"@1688434404.043312832\",\"description\":\"Server shutdown\",\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/surface/server.cc\",\"file_line\":480,\"grpc_status\":0}"]
[2023/07/04 01:33:24.044 +00:00] [INFO] [<unknown>] ["ipv4:10.0.0.16:41869: Sending goaway err={\"created\":\"@1688434404.043313994\",\"description\":\"Server shutdown\",\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/surface/server.cc\",\"file_line\":480,\"grpc_status\":0}"]

TiDB error messages:

[2023/07/04 01:33:24.057 +00:00] [INFO] [client_batch.go:609] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.0.17:20160] [forwardedHost=] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/04 01:33:24.057 +00:00] [INFO] [client_batch.go:609] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.0.17:20160] [forwardedHost=] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/04 01:33:24.057 +00:00] [INFO] [client_batch.go:609] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.0.17:20160] [forwardedHost=] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/04 01:33:24.057 +00:00] [INFO] [client_batch.go:609] ["batchRecvLoop fails when receiving, needs to reconnect"] [target=10.0.0.17:20160] [forwardedHost=] [error="rpc error: code = Unavailable desc = transport is closing"]
[2023/07/04 01:33:24.364 +00:00] [WARN] [client_batch.go:365] ["no available connections"] [target=10.0.0.17:20160]
[2023/07/04 01:33:24.366 +00:00] [INFO] [region_cache.go:2199] ["[health check] check health error"] [store=10.0.0.17:20160] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.0.0.17:20160: connect: connection refused\""]
[2023/07/04 01:33:24.366 +00:00] [INFO] [region_request.go:785] ["mark store's regions need be refill"] [id=297684] [addr=10.0.0.17:20160] [error="no available connections"]

translator_bot · June 22, 2024, 4:41am

| username: tidb菜鸟一只 | Original post link

If one of the three replicas in TiDB goes down, it won’t cause a complete outage, but it won’t keep the TiDB connection uninterrupted either. For example, if you are connected to the TiDB node at 10.0.0.14:4000 executing SQL, and it is querying the leader data from the TiKV node at 10.0.0.17:20160, if 10.0.0.17:20160 suddenly goes offline, the SQL will inevitably report an error and require a reconnection. At this point, other follower nodes will become leaders and continue to provide service to TiDB. It can still provide service, but a reconnection is definitely needed.

translator_bot · June 22, 2024, 4:41am

| username: redgame | Original post link

Lost connection… trying again.

translator_bot · June 22, 2024, 4:41am

| username: buptzhoutian | Original post link

This method is a bit forceful; it should be done when the entire cluster is stopped.

If you want the disk replacement process to be transparent to the client, the official recommended approach is to use TiUP’s scale-in/scale-out.

translator_bot · June 22, 2024, 4:41am

| username: TiDB_C罗 | Original post link

So why do we need to disconnect the service from TiDB? I understand that TiDB only needs to reconnect to other TiKV nodes, and at most, the service requests will be slower. Additionally, for this situation, is there any way or can we manually transfer the leader on this TiKV node to other TiKV nodes?

translator_bot · June 22, 2024, 4:41am

| username: tidb菜鸟一只 | Original post link

You can first evict all the leaders on the TiKV nodes to other TiKV nodes using the pd-ctl command:
scheduler add evict-leader-scheduler 1: Add a scheduler to remove all leaders from Store 1.

translator_bot · June 22, 2024, 4:41am

| username: TiDB_C罗 | Original post link

This scale-in/scale-out method is feasible, but the drawback is that the cycle is too long.

translator_bot · June 22, 2024, 4:41am

| username: TiDB_C罗 | Original post link

This is indeed a best practice, but I still feel that there is no need to disconnect the service from TiDB.

translator_bot · June 22, 2024, 4:41am

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.