Abnormal Restart of tidb_server

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tidb_server异常重启

| username: 等一分钟

Recently, I found that tidb_server frequently restarts abnormally. The error message is as follows:
[root@izuf6fxgyhdpixhne3l54nz record]# more running_sql2023-04-16T04:11:14+08:00
The 10 SQLs with the most memory usage for OOM analysis

The 10 SQLs with the most time usage for OOM analysis

No specific SQL was recorded either.

| username: 等一分钟 | Original post link

Some error logs don’t take up much memory either.
[root@izuf6fxgyhdpixhne3l54mz record]# grep mem_max running_sql2023-04-17T12:46:59+08:00
mem_max: 6741 Bytes (6.58 KB)
mem_max: 3200 Bytes (3.12 KB)
mem_max: 3200 Bytes (3.12 KB)
mem_max: 1804 Bytes (1.76 KB)
mem_max: 1084 Bytes (1.06 KB)
mem_max: 986 Bytes (986 Bytes)
mem_max: 986 Bytes (986 Bytes)
mem_max: 0 Bytes (0 Bytes)
mem_max: 0 Bytes (0 Bytes)
mem_max: 0 Bytes (0 Bytes)
mem_max: 0 Bytes (0 Bytes)
mem_max: 0 Bytes (0 Bytes)
mem_max: 3200 Bytes (3.12 KB)
mem_max: 0 Bytes (0 Bytes)
mem_max: 0 Bytes (0 Bytes)
mem_max: 0 Bytes (0 Bytes)
mem_max: 0 Bytes (0 Bytes)
mem_max: 0 Bytes (0 Bytes)
mem_max: 0 Bytes (0 Bytes)
mem_max: 0 Bytes (0 Bytes)

| username: 等一分钟 | Original post link

The image you provided cannot be processed for translation. Please provide the text content directly for translation.

| username: 等一分钟 | Original post link

The memory usage of the crashed host is also not high.

| username: xingzhenxiang | Original post link

TiDB server OOM is a common issue. You can monitor which SQL statements use a lot of memory, then set a standard, and kill those that exceed it to reduce the occurrence of OOM situations.

| username: 等一分钟 | Original post link

The issue is that from the OOM log above, no SQL with high memory usage was found, and some SQLs did not record any information.

| username: TiDBer_pkQ5q1l0 | Original post link

Sometimes it may not necessarily be caused by SQL; internal bugs in TiDB can also lead to issues, such as an excessive number of slow log files, analyze version problems, etc.

| username: 等一分钟 | Original post link

goroutine 83243009 [IO wait, 98 minutes]:
internal/poll.runtime_pollWait(0x7f8bb2138830, 0x72, 0xffffffffffffffff)
/usr/local/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc0ba059d18, 0x72, 0x4000, 0x4000, 0xffffffffffffffff)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(…)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0ba059d00, 0xc03f808000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:166 +0x1d5
net.(*netFD).Read(0xc0ba059d00, 0xc03f808000, 0x4000, 0x4000, 0xb, 0x0, 0x0)
/usr/local/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc079f3a350, 0xc03f808000, 0x4000, 0x4000, 0x0, 0x0, 0x0)
/usr/local/go/src/net/net.go:183 +0x91
bufio.(*Reader).Read(0xc0bf706c00, 0xc0830a826c, 0x4, 0x4, 0x0, 0xc0405ff4d8, 0xc0b1431100)
/usr/local/go/src/bufio/bufio.go:227 +0x222
github.com/pingcap/tidb/server.bufferedReadConn.Read(...)
/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/buffered_read_conn.go:30
io.ReadAtLeast(0x3fe95e0, 0xc16480ebe8, 0xc0830a826c, 0x4, 0x4, 0x4, 0x0, 0x72, 0x0)
/usr/local/go/src/io/io.go:328 +0x87
io.ReadFull(…)
/usr/local/go/src/io/io.go:347
github.com/pingcap/tidb/server.(*packetIO).readOnePacket(0xc0d307ed40, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/packetio.go:85 +0x85
github.com/pingcap/tidb/server.(*packetIO).readPacket(0xc0d307ed40, 0x65d57e85b663, 0x5d6a8a0, 0x0, 0x5d6a8a0, 0x0)
/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/packetio.go:116 +0x4f
github.com/pingcap/tidb/server.(*clientConn).readPacket(...)
/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:387
github.com/pingcap/tidb/server.(*clientConn).Run(0xc060c917a0, 0x4038470, 0xc0cc324030)
/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:963 +0x1c5
github.com/pingcap/tidb/server.(*Server).onConn(0xc0257e0410, 0xc060c917a0)
/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/server.go:501 +0xa53
created by github.com/pingcap/tidb/server.(*Server).startNetworkListener
/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/server.go:404 +0x8fc

| username: 等一分钟 | Original post link

Can you tell anything from this?

| username: TiDBer_pkQ5q1l0 | Original post link

It’s not clear. Try using pprof to generate a memory heap file and see which method is consuming the most memory.

| username: TiDBer_pkQ5q1l0 | Original post link

Handling OOM Issues Requires Collecting Diagnostic Information

| username: 等一分钟 | Original post link

Okay, I’ll give it a try.

| username: 等一分钟 | Original post link

Is there documentation on how to operate this?

| username: TiDBer_pkQ5q1l0 | Original post link

The image you provided is not visible. Please provide the text content that needs to be translated.

| username: xingzhenxiang | Original post link

Here is my script for handling SQL. When a select statement uses more than 10G of memory or takes more than 10 minutes, I log the statement and then kill it. I put it in crontab to run every minute.

for list in `/mysql5.7/bin/mysql -hXXX.XXX.XXX -p'yourpassword' -vvv -e "select id from INFORMATION_SCHEMA.processlist a where a.info is not null and (mem >=11474836480 or time >600);" | grep -Ev 'id|ID|iD|Id' | awk -F "|" '{print $2}'`
do 
  echo $list
  /mysql5.7/bin/mysql -hXXX.XXX.XXX -p'yourpassword' -vvv -e "select id, time, info, mem from INFORMATION_SCHEMA.processlist a where id=$list and a.info is not null;" >/sh/killtestlog/`date +%s`.log
  /mysql5.7/bin/mysql -hXXX.XXX.XXX -p'yourpassword' -vvv -e "kill tidb $list;"
done;
| username: 等一分钟 | Original post link

Thank you.

| username: 等一分钟 | Original post link

tidb_server, why not clear the historical memory content?

| username: knull | Original post link

This is a feature of the Go language: First, the garbage collector (GC) reclaims the application’s memory, and after reclamation, it is managed by the runtime; then, periodically, specific memory is returned to the system, but not immediately. Most garbage-collected languages work this way, right?