TiDB 6.1.0 Version TiFlash Abnormal Restart

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDB 6.1.0 版本 TiFlash 异常重启

| username: Hacker_ojLJ8Ndr

In the production environment, after upgrading to 6.1.0 the previous day and enabling dynamic pruning mode, all TiFlash nodes reported memory allocation errors during the nightly batch run, causing the servers to restart. After the restart, dynamic pruning mode was disabled. The environment is a mixed deployment of TiKV and TiFlash. Today, there were no memory allocation errors, but the following error occurred. Only one TiFlash node had an issue, and this node did not have NUMA configured, while other mixed deployment nodes had NUMA configured.

【TiDB Version】
6.1.0
【Issue Encountered】
During the nightly batch run, TiFlash repeatedly restarted with the following error:
tiflash_stderr.log:
Logging debug to /data01/deploy/log/tiflash.log
Logging errors to /data01/deploy/log/tiflash_error.log
deprecated configuration, log-file has been moved to log.file.filename
override log.file.filename with log-file, “/data01/deploy/log/tiflash_tikv.log”
libc++abi: terminate_handler unexpectedly threw an exception
Logging debug to /data01/deploy/log/tiflash.log
Logging errors to /data01/deploy/log/tiflash_error.log
deprecated configuration, log-file has been moved to log.file.filename
override log.file.filename with log-file, “/data01/deploy/log/tiflash_tikv.log”

tiflash.log:

【Solution】
After forcibly scaling down and then scaling up the problematic node, the error did not reoccur, and NUMA binding was added.

| username: yilong | Original post link

  1. What are the resource conditions for mixed deployment of TiKV and TiFlash? How much CPU and memory are required?
  2. Please upload the complete tiflash.log file.
| username: Hacker_ojLJ8Ndr | Original post link

The clinic collection failed to upload. Here are the CPU and memory details:

"cpu": {
  "vendor": "GenuineIntel",
  "model": "Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz",
  "speed": 2300,
  "cache": 22528,
  "cpus": 2,
  "cores": 32,
  "threads": 64,
  "governor": "powersave"
}
"memory": {
  "type": "DDR4",
  "speed": 3200,
  "size": 327680,
  "swap": 65535
}

TiKV parameters:

log.file.max-days: 180
raftstore.raft-base-tick-interval: 2s
storage.block-cache.capacity: 100GB

NUMA binding: TiKV is bound to 1, TiFlash is bound to 0.

The TiFlash log is too large, so I have attached a small segment of it: tiflash.txt (920.7 KB)

| username: Hacker_ojLJ8Ndr | Original post link

The logs for the 24th are not being received by the clinic; this is from the 23rd.

| username: yilong | Original post link

It seems to be a problem, right? 升级6.1后,TiFlash服务异常 - TiDB 的问答社区

| username: Hacker_ojLJ8Ndr | Original post link

Yes.

| username: yilong | Original post link

Please just follow that post, thanks.

| username: Hacker_ojLJ8Ndr | Original post link

Okay~

| username: tidb狂热爱好者 | Original post link

This can’t be resolved, wait for the official response.

| username: system | Original post link

This topic was automatically closed 1 minute after the last reply. No new replies are allowed.