TiFlash Automatic Shutdown

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: tiflash自动关闭

| username: 扬仔_tidb

[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Problem Phenomenon and Impact]
The cluster was synchronized well a few days ago, and today the traffic was switched over. Just found that one TiFlash node is abnormally down. It won’t start up, and it automatically stops after starting.
The logs are as follows:
[2023/11/22 20:11:51.331 +08:00] [DEBUG] [Segment.cpp:1923] [“Finish segment getReadInfo”] [source=“table_id=1267 segment_id=26930 epoch=57”]

[thread_id=24]

[2023/11/22 20:11:51.331 +08:00] [DEBUG] [Segment.cpp:1930] [“Segment updated delta index”] [source=“table_id=1267 segment_id=26930 epoch=57”

] [thread_id=24]

[2023/11/22 20:11:51.336 +08:00] [DEBUG] [Segment.cpp:2098] ["Finish segment ensurePlace, read_ranges={[-9223372036854775808,9223372036854775

807)} placed_items=1 shared_delta_index=<placed_rows=0 placed_deletes=0 tree_entries=0 tree_inserts=0 tree_deletes=0> my_delta_index=<placed_

rows=63805 placed_deletes=0 tree_entries=63713 tree_inserts=63713 tree_deletes=0>"] [source=“table_id=1267 segment_id=12422 epoch=68”] [threa

d_id=30]

[2023/11/22 20:11:51.337 +08:00] [DEBUG] [Segment.cpp:1923] [“Finish segment getReadInfo”] [source=“table_id=1267 segment_id=12422 epoch=68”]

[thread_id=30]

[2023/11/22 20:11:51.337 +08:00] [DEBUG] [Segment.cpp:1930] [“Segment updated delta index”] [source=“table_id=1267 segment_id=12422 epoch=68”

] [thread_id=30]

[2023/11/22 20:11:51.339 +08:00] [DEBUG] [Segment.cpp:2098] ["Finish segment ensurePlace, read_ranges={[-9223372036854775808,9223372036854775

807)} placed_items=1 shared_delta_index=<placed_rows=0 placed_deletes=0 tree_entries=0 tree_inserts=0 tree_deletes=0> my_delta_index=<placed_

rows=55684 placed_deletes=0 tree_entries=55574 tree_inserts=55574 tree_deletes=0>"] [source=“table_id=1267 segment_id=26951 epoch=12”] [threa

d_id=38]

[2023/11/22 20:11:51.339 +08:00] [DEBUG] [Segment.cpp:1923] [“Finish segment getReadInfo”] [source=“table_id=1267 segment_id=26951 epoch=12”]

[thread_id=38]

[2023/11/22 20:11:51.339 +08:00] [DEBUG] [Segment.cpp:1930] [“Segment updated delta index”] [source=“table_id=1267 segment_id=26951 epoch=12”

] [thread_id=38]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=3]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=6]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=9]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=11]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=14]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=16]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=8]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=4]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=10]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=2]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=12]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=13]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=7]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=15]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=5]

[2023/11/22 20:11:52.299 +08:00] [INFO] [SegmentReader.cpp:94] [“Pop fail, stop=true”] [source=SegmentReader] [thread_id=17]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.299 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.300 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.300 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.300 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.300 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.300 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.300 +08:00] [DEBUG] [SegmentReader.cpp:45] [“Stop begin”] [source=SegmentReader] [thread_id=37]

[2023/11/22 20:11:52.300 +08:00] [DEBUG] [SegmentReader.cpp:47] [“Stop end”] [source=SegmentReader] [thread_id=37]

| username: 扬仔_tidb | Original post link

There are a large number of entries in tiflash_error.log:

[2023/11/22 20:21:50.686 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

[2023/11/22 20:22:08.426 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

[2023/11/22 20:22:26.172 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

[2023/11/22 20:22:43.925 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

[2023/11/22 20:23:01.676 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

[2023/11/22 20:23:19.423 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

[2023/11/22 20:23:37.171 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

[2023/11/22 20:23:54.673 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

[2023/11/22 20:24:12.450 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

[2023/11/22 20:24:29.919 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

[2023/11/22 20:24:47.442 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

[2023/11/22 20:25:04.924 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]

| username: Billmay表妹 | Original post link

How many TiFlash nodes?

| username: TiDB_C罗 | Original post link

There are no error-level logs. Check the system logs then.

| username: Billmay表妹 | Original post link

The configuration item “path” is deprecated. Please check the [storage] section and adopt the new style.

| username: 扬仔_tidb | Original post link

What does the new format look like? It’s not mentioned in the documentation.

| username: 扬仔_tidb | Original post link

Nov 22 20:57:12 tiflash-5 systemd: tiflash-9000.service holdoff time over, scheduling restart.

Nov 22 20:57:12 tiflash-5 systemd: Stopped tiflash service.

Nov 22 20:57:12 tiflash-5 systemd: Started tiflash service.

Nov 22 20:57:12 tiflash-5 bash: sync …

Nov 22 20:57:12 tiflash-5 bash: real#0110m0.061s

Nov 22 20:57:12 tiflash-5 bash: user#0110m0.000s

Nov 22 20:57:12 tiflash-5 bash: sys#0110m0.038s

Nov 22 20:57:12 tiflash-5 bash: ok

Nov 22 20:57:15 tiflash-5 systemd: tiflash-9000.service: main process exited, code=exited, status=1/FAILURE

Nov 22 20:57:15 tiflash-5 systemd: Unit tiflash-9000.service entered failed state.

Nov 22 20:57:15 tiflash-5 systemd: tiflash-9000.service failed.

| username: 扬仔_tidb | Original post link

There are 5 TiFlash nodes. The other 3 were deployed during the initial installation. The 4th and 5th nodes were added during a scale-out expansion. The 5th node inexplicably went down today. The system version is CentOS 7.4, 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux.

| username: ShawnYan | Original post link

What version of TiFlash are you using? Is only this node having issues? Is there anything special about table 1267?

| username: Kongdom | Original post link

I searched for several similar cases, all of which were caused by similar upgrade operations. I suggest posting the TiFlash configuration file to take a look; it does seem likely to be a configuration issue.

| username: 扬仔_tidb | Original post link

Version 6.5. This table was merged from 64 sub-tables by upstream Otter. It was synchronized in the same way in version 5.
If there’s a difference, I found that the pk_type of this table is different; several tables have NONCLUSTERED.

| username: WalterWj | Original post link

Does this log seem unlikely to cause TiFlash to go down? Are there any other logs? Did TiFlash run out of memory (OOM)?

| username: 扬仔_tidb | Original post link

  • host: omitted
    ssh_port: 22
    tcp_port: 9000
    http_port: 8123
    flash_service_port: 3930
    flash_proxy_port: 20170
    flash_proxy_status_port: 20292
    metrics_port: 8234
    deploy_dir: /opt/tidb-deploy/tiflash-9000
    data_dir: /data/tidb-data/tiflash-9000
    log_dir: /opt/tidb-deploy/tiflash-9000/log
    arch: amd64
    os: linux
| username: 扬仔_tidb | Original post link

There are many core logs,
gdb ./core.31493 shows the following

Failed to read a valid object file image from memory.

Core was generated by `bin/tiflash/tiflash server --config-file conf/tiflash.toml’.

Program terminated with signal 6, Aborted.

#0 0x00007f1c166b4387 in ?? ()

| username: 芮芮是产品 | Original post link

It seems that the memory of your machine is faulty. Try using an eraser to clean the contacts, remove the node, and then add it back.

| username: 扬仔_tidb | Original post link

The machine has been restarted many times. Eraser?
Did you use the scale-in command? After executing Pending Offline, it keeps showing this status, and the remote TiFlash hasn’t started. It shouldn’t be able to go down like this, right?

| username: 我是人间不清醒 | Original post link

Are the nodes virtual machines? If they are, start with the same configuration and remove the problematic TiFlash node after logging. If they are physical machines and cannot start, remove them, clean up, and rejoin the cluster. Can the four nodes support production the next day?

| username: Billmay表妹 | Original post link

What is the subversion, is it also 6.5.5? TiDB is 6.5.5, is tiflash also 6.5.5?

| username: 扬仔_tidb | Original post link

There are still 4 machines that should be able to handle it.
The machines are Tencent S6 model with Intel Ice Lake (2.7GHz/3.3GHz), 16 cores/32GB memory.
However, a few other machines also have individual core files, so I’m very worried that other machines might suddenly crash during the day.

| username: Billmay表妹 | Original post link

The system version is too low.