TiKV shows a large number of zombie connections in netstat

translator_bot · June 20, 2024, 1:49pm

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tikv查看netstat存在大量僵尸连接

| username: TiDBer_fancy

[TiDB Usage Environment] Production Environment / Testing / PoC
[TiDB Version]
[Reproduction Path] Execute netstat -nop | grep -i “tikv-serv”
[Encountered Problem: Problem Phenomenon and Impact]

When executing netstat -nop | grep -i “tikv-serv”, many IP hosts that have already gone offline still show the connection status as ESTABLISHED.
The system has already set
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_intvl=5
net.ipv4.tcp_keepalive_probes=3
However, TiKV has not enabled the tcp_keepalive function, so connections cannot be reclaimed.

How can TiKV enable the tcp_keepalive function?
How can these connections be closed or reclaimed? (Preferably without restarting or affecting the business)

translator_bot · June 20, 2024, 1:49pm

| username: zhaokede | Original post link

Determine the parent process of the zombie process. In the output of the ps command, the PPID (Parent Process ID) column shows the parent process ID of each process.
Analyze the behavior of the parent process. Why hasn’t it reclaimed its child process’s resources? It could be because the parent process is waiting for certain conditions to reclaim the resources, or the parent process itself has crashed or is hung.

translator_bot · June 20, 2024, 1:49pm

| username: xfworld | Original post link

In what scenarios would this be used? Is it an application mode for bare TiKV? But the client side has a connection recycling mechanism, could it be caused by forced closure?

translator_bot · June 20, 2024, 1:49pm

| username: TiDBer_H5NdJb5Q | Original post link

Is this the connection between tikv and tidb server?

translator_bot · June 20, 2024, 1:49pm

| username: TiDBer_fancy | Original post link

There are no zombie processes; the client’s host has already gone offline, but the connection is still ESTABLISHED.

translator_bot · June 20, 2024, 1:49pm

| username: TiDBer_fancy | Original post link

The use case is as the metadata storage for JuiceFS, in the TiKV txn mode. It is likely caused by forcibly shutting down the process.

translator_bot · June 20, 2024, 1:49pm

| username: TiDBer_fancy | Original post link

No, the connection is generated by connecting directly to TiKV using client-go.

translator_bot · June 20, 2024, 1:49pm

| username: vincentLi | Original post link

I suggest going to the machine of the client (fourth column) to see which process is connecting to TiKV. It’s not just the TiDB server that can connect to TiKV; tools like Lightning can also connect.

translator_bot · June 20, 2024, 1:49pm

| username: xfworld | Original post link

client-go has a paradigm for handling abnormal connections, right? Could it be that it was missed, causing the issue?

translator_bot · June 20, 2024, 1:49pm

| username: TiDBer_fancy | Original post link

For now, let’s not consider how it was generated; let’s solve the current problem first.

translator_bot · June 20, 2024, 1:49pm

| username: TiDBer_fancy | Original post link

The connection generated is very clear; it is client-go in the juicefs client. The key issue is that the machine has already gone offline, but the connection status is still ESTABLISHED.

translator_bot · June 20, 2024, 1:49pm

| username: xfworld | Original post link

Restart by node instance… It’s best to have enough nodes to support the number of replicas…

As for the program’s pitfalls, finding a workaround from elsewhere? That seems quite difficult, but you can try more…

translator_bot · June 20, 2024, 1:49pm

| username: TiDBer_fancy | Original post link

The program is used unreasonably, so the database side does not need to do self-protection. Theoretically, the server side should also handle it.

translator_bot · June 20, 2024, 1:49pm

| username: xfworld | Original post link

In the open-source community, there currently might not be the capability to support this. You can raise an issue to see if there are any plans for iteration.

If you have a particularly strong need for this feature, you might consider seeking support from the original developers, upgrading your service level, and possibly finding a solution.

translator_bot · June 20, 2024, 1:49pm

| username: Defined2014 | Original post link

You can try packet capturing first. In theory, gRPC also has application-layer keepalive, so there shouldn’t be any lingering connections.

translator_bot · June 20, 2024, 1:49pm

| username: Billmay表妹 | Original post link

Join the group for consultation.

translator_bot · August 19, 2024, 6:09am

| username: TiDBer_tvqzG8Dk | Original post link

Go to the source and see what tool or application is connecting.

translator_bot · August 29, 2024, 1:51am

| username: Hacker_zuGnSsfP | Original post link

Has this issue been resolved?