Drainer Crashed and Cannot Start

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: drainer挂掉无法启动

| username: TiDBer_f3MDYqWL

[TiDB Usage Environment] Production Environment
[TiDB Version] 2.1.2
[Encountered Problem: Phenomenon and Impact]
Recently, after enabling the binlog component for the cluster through Ansible, the drainer frequently restarts, with the following error logs. I have already found the documentation and configured Kafka’s parameters:
message.max.bytes=1073741824
replica.fetch.max.bytes=1073741824
fetch.message.max.bytes=1073741824
However, after a while, the drainer crashes again. Kafka shows no anomalies and no OOM errors. Is there a way to control the maximum message size on the sending side?

[Resource Configuration] 1 drainer instance with 8c16g, 3 pump instances each with 8c16g
[Attachments: Screenshots/Logs/Monitoring]
Logs:
2023/03/28 11:12:01 server.go:97: [info] clusterID of drainer server is 6575739456418311228
2023/03/28 11:12:01 checkpoint.go:49: [info] initialize kafka type checkpoint binlog commitTS = 440393664168984615 with config &{Db:0xc0001472c0 Schema: Table: ClusterID:6575739456418311228 InitialCommitTS:440338529740390402 CheckPointFile:/data1/deploy/data.drainer/savepoint}
2023/03/28 11:12:01 server.go:291: [info] register success, this drainer’s node id is tidb-drainer-kafka:8249
2023/03/28 11:12:02 server.go:342: [info] start to server request on http://172.23.5.156:8249
2023/03/28 11:12:03 merge.go:208: [info] merger add source tidb-pump-01:8250
2023/03/28 11:12:03 merge.go:208: [info] merger add source tidb-pump-03:8250
2023/03/28 11:12:03 merge.go:208: [info] merger add source tidb-pump-02:8250
2023/03/28 11:12:03 pump.go:115: [info] [pump tidb-pump-02:8250] create pull binlogs client
2023/03/28 11:12:03 pump.go:115: [info] [pump tidb-pump-03:8250] create pull binlogs client
2023/03/28 11:12:03 pump.go:115: [info] [pump tidb-pump-01:8250] create pull binlogs client
2023/03/28 11:12:03 client.go:120: [sarama] Initializing new client
2023/03/28 11:12:03 config.go:361: [sarama] Producer.MaxMessageBytes must be smaller than MaxRequestSize; it will be ignored.
2023/03/28 11:12:03 config.go:382: [sarama] ClientID is the default of ‘sarama’, you should consider setting it to something application-specific.
2023/03/28 11:12:03 client.go:167: [sarama] Successfully initialized new client
2023/03/28 11:12:03 config.go:361: [sarama] Producer.MaxMessageBytes must be smaller than MaxRequestSize; it will be ignored.
2023/03/28 11:12:03 config.go:382: [sarama] ClientID is the default of ‘sarama’, you should consider setting it to something application-specific.
2023/03/28 11:12:03 client.go:699: [sarama] client/metadata fetching metadata for [tidb] from broker 172.23.5.156:9092
2023/03/28 11:12:03 broker.go:148: [sarama] Connected to broker at 172.23.5.156:9092 (unregistered)
2023/03/28 11:12:33 client.go:726: [sarama] client/metadata got error from broker while fetching metadata: read tcp 172.23.5.156:34540->172.23.5.156:9092: i/o timeout
2023/03/28 11:12:33 broker.go:191: [sarama] Closed connection to broker 172.23.5.156:9092
2023/03/28 11:12:33 client.go:732: [sarama] client/metadata no available broker to send metadata request to
2023/03/28 11:12:33 client.go:508: [sarama] client/brokers resurrecting 1 dead seed brokers
2023/03/28 11:12:33 client.go:690: [sarama] client/metadata retrying after 500ms… (10000 attempts remaining)
2023/03/28 11:12:33 config.go:361: [sarama] Producer.MaxMessageBytes must be smaller than MaxRequestSize; it will be ignored.
2023/03/28 11:12:33 config.go:382: [sarama] ClientID is the default of ‘sarama’, you should consider setting it to something application-specific.
2023/03/28 11:12:33 client.go:699: [sarama] client/metadata fetching metadata for [tidb] from broker 172.23.5.156:9092
2023/03/28 11:12:33 broker.go:148: [sarama] Connected to broker at 172.23.5.156:9092 (unregistered)
2023/03/28 11:12:34 syncer.go:383: [fatal] /home/jenkins/workspace/build_tidb_binlog_master/go/src/github.com/pingcap/tidb-binlog/drainer/executor/kafka.go:193: fail to push msg to kafka after 30s, check if kafka is up and working
/home/jenkins/workspace/build_tidb_binlog_master/go/src/github.com/pingcap/tidb-binlog/drainer/executor/kafka.go:163:
/home/jenkins/workspace/build_tidb_binlog_master/go/src/github.com/pingcap/tidb-binlog/drainer/executor/kafka.go:134:

| username: Lucien-卢西恩 | Original post link

It looks like there is an issue with the Kafka downstream parameter configuration, causing the synchronization to fail continuously. Additionally, you can check the network by using telnet from the Drainer node to see if the downstream Kafka network is accessible.

| username: TiDBer_f3MDYqWL | Original post link

Drainer and Kafka are deployed on the same machine, and my telnet connection is successful.