This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: 6.1.2

| username: Soysauce520

[TiDB Usage Environment] Testing
[TiDB Version] v6.1.2
[Reproduction Path]
[Encountered Issue: Problem Phenomenon and Impact] TiDB OOM, merged deployment with PD component, drainer restarted, and the following log appeared
[Resource Configuration]
[Attachments: Screenshots/Logs/Monitoring]
The following appears in drainer.log
[ERROR] [server.go:242] [“send heartbeat failed”] [error=“context deadline exceeded”] [errorVerbose=“context deadline exceeded\ngithub.com/pingcap/errors.AddStack\n\t/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/juju_adaptor.go:15\ngithub.com/pingcap/tidb-binlog/pkg/etcd.(*Client).Get\n\t/var/lib/docker/jenkins/workspace/build-common@5/go/src/github.com/pingcap/tidb-binlog/pkg/etcd/etcd.go:96\ngithub.com/pingcap/tidb-binlog/pkg/node.(*EtcdRegistry).checkNodeExists\n\t/var/lib/docker/jenkins/workspace/build-common@5/go/src/github.com/pingcap/tidb-binlog/pkg/node/registry.go:104\ngithub.com/pingcap/tidb-binlog/pkg/node.(*EtcdRegistry).UpdateNode\n\t/var/lib/docker/jenkins/workspace/build-common@5/go/src/github.com/pingcap/tidb-binlog/pkg/node/registry.go:91\ngithub.com/pingcap/tidb-binlog/drainer.(*Server).updateStatus\n\t/var/lib/docker/jenkins/workspace/build-common@5/go/src/github.com/pingcap/tidb-binlog/drainer/server.go:441\ngithub.com/pingcap/tidb-binlog/drainer.(*Server).heartbeat\n\t/var/lib/docker/jenkins/workspace/build-common@5/go/src/github.com/pingcap/tidb-binlog/drainer/server.go:240\ngithub.com/pingcap/tidb-binlog/drainer.(*Server).Start.func1\n\t/var/lib/docker/jenkins/workspace/build-common@5/go/src/github.com/pingcap/tidb-binlog/drainer/server.go:272\ngithub.com/pingcap/tidb-binlog/drainer.(*taskGroup).start.func1\n\t/var/lib/docker/jenkins/workspace/build-common@5/go/src/github.com/pingcap/tidb-binlog/drainer/util.go:81\nruntime.goexit\n\t/usr/local/go1.18.5/src/runtime/asm_arm64.s:1263”]

Dear experts, what is the reason for the drainer restart? I checked the network and found no issues; the ping results are all normal.

| username: 像风一样的男子 | Original post link

Check the SQL memory usage during the OOM period; it is highly likely caused by slow SQL.

| username: tidb菜鸟一只 | Original post link

You can grep "expensive_query" in tidb.log, which will record SQL queries that run overtime or exceed the memory threshold. First, identify the TiDB memory OOM issue. Alternatively, set the single SQL memory threshold to the default value of 4GB, which can be configured through the tidb_mem_quota_query system variable, to ensure that the TiDB service does not encounter exceptions.

| username: Soysauce520 | Original post link

TiDB confirmed it was an OOM issue, and the SQL query has been identified. What I want to know is the reason for the Drainer restart. I have made some modifications to the explanation.

| username: Soysauce520 | Original post link

I found the SQL that caused the TiDB server to OOM and restart, and I want to analyze the reason for the drainer restart.

| username: 普罗米修斯 | Original post link

Is there any monitoring of the memory usage of the drainer process? Was it restarted during the startup process or during use? The drainer occupies a particularly large amount of memory during startup as it traverses the upstream historical data in memory. After the startup is complete, the memory usage returns to normal.

| username: 像风一样的男子 | Original post link

Are there any other error logs for drainer?

| username: Soysauce520 | Original post link

There is monitoring, and memory usage is normal.

| username: Soysauce520 | Original post link

There are stderr logs of the restart, {“logger”:“etcd-client”,“caller”:“v3@v3.5.2/retry_interceptor.go:62”, “msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0x52e038e000/”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}

| username: Daniel-W | Original post link

Check the status of PD.

| username: Soysauce520 | Original post link

The PD status shows everything is normal, and there is no leader switch. What else should I check?

| username: Daniel-W | Original post link

Check if there are any issues with the connected PD.

| username: Soysauce520 | Original post link

The connected PD happens to be on the same machine where TiDB OOM occurred, and the PD log shows that it was stuck for 2 minutes. However, it is a follower, and I don’t know why this happened. I will need to check the follower logs in the future.

| username: Fly-bird | Original post link

Is it an issue with PD?

| username: system | Original post link

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.