Why does the TiFlash process keep restarting after executing the create replica command in TiDB v6.5.0?

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: TiDBv6.5.0 执行创建副本命令后,tiflash进程不断重启,是什么原因呢?

| username: tracy0984

【TiDB Usage Environment】Test
【TiDB Version】
TiDB v6.5.0
【Reproduction Path】
ALTER DATABASE test SET TIFLASH REPLICA 1;
【Encountered Problem: Symptoms and Impact】
Tiflash_error.log
$ tail -f tiflash_error.log
[2023/01/04 17:15:29.929 +08:00] [WARN] [StorageConfigParser.cpp:241] [“The configuration "path" is deprecated. Check [storage] section for new style.”] [thread_id=1]
[2023/01/04 17:18:39.194 +08:00] [ERROR] [BaseDaemon.cpp:377] [########################################] [source=BaseDaemon] [thread_id=27]
[2023/01/04 17:18:39.194 +08:00] [ERROR] [BaseDaemon.cpp:378] [“(from thread 26) Received signal Illegal instruction(4).”] [source=BaseDaemon] [thread_id=27]
[2023/01/04 17:18:39.194 +08:00] [ERROR] [BaseDaemon.cpp:473] [“Illegal operand.”] [source=BaseDaemon] [thread_id=27]
[2023/01/04 17:18:45.690 +08:00] [ERROR] [BaseDaemon.cpp:570] [“\n 0x61214f1\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+101848305]\n \tlibs/libdaemon/src/BaseDaemon.cpp:221\n 0x7fbaed382070\t [libc.so.6+266352]\n 0x682303b\tDB::countBytesInFilter(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul> const&) [tiflash+109195323]\n \tdbms/src/Columns/ColumnsCommon.cpp:66\n 0x16d4d10\tDB::DM::DMVersionFilterBlockInputStream<1>::read(DB::PODArray<unsigned char, 4096ul, Allocator, 15ul, 16ul>*&, bool) [tiflash+23940368]\n \tdbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.cpp:373\n 0x6a93168\tDB::DM::BoundedSSTFilesToBlockInputStream::read() [tiflash+111751528]\n \tdbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:307\n 0x16d8a44\tDB::DM::SSTFilesToDTFilesOutputStream<std::__1::shared_ptrDB::DM::BoundedSSTFilesToBlockInputStream >::write() [tiflash+23956036]\n \tdbms/src/Storages/DeltaMerge/SSTFilesToDTFilesOutputStream.cpp:200\n 0x6a89c8f\tDB::KVStore::preHandleSSTsToDTFiles(std::__1::shared_ptrDB::Region, DB::SSTViewVec, unsigned long, unsigned long, DB::DM::FileConvertJobType, DB::TMTContext&) [tiflash+111713423]\n \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:360\n 0x6a89364\tDB::KVStore::preHandleSnapshotToFiles(std::__1::shared_ptrDB::Region, DB::SSTViewVec, unsigned long, unsigned long, DB::TMTContext&) [tiflash+111711076]\n \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:275\n 0x6ae4666\tPreHandleSnapshot [tiflash+112084582]\n \tdbms/src/Storages/Transaction/ProxyFFI.cpp:388\n 0x7fbaee7abe28\tengine_store_ffi::$LT$impl$u20$engine_store_ffi…interfaces…root…DB…EngineStoreServerHelper$GT$::pre_handle_snapshot::h14b79dee0dc709dc [libtiflash_proxy.so+17415720]\n 0x7fbaee7a3899\tengine_store_ffi::observer::pre_handle_snapshot_impl::ha6024895a135c813 [libtiflash_proxy.so+17381529]\n 0x7fbaee797236\tyatp::task::future::RawTask$LT$F$GT$::poll::haa9f3b11a91f3b25 [libtiflash_proxy.so+17330742]\n 0x7fbaf05a59b3\t$LT$yatp…task…future…Runner$u20$as$u20$yatp…pool…runner…Runner$GT$::handle::h3ec77eaab16f0540 [libtiflash_proxy.so+48847283]\n 0x7fbaf059820c\tstd::sys_common::backtrace::__rust_begin_short_backtrace::h2dafe5d145aa6fe4 [libtiflash_proxy.so+48792076]\n 0x7fbaf0598ccc\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h8730f42d18399d62 [libtiflash_proxy.so+48794828]\n 0x7fbaefd13375\tstd::sys::unix::thread::thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+39859061]\n \t/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108\n 0x7fbaed3cc4ca\t [libc.so.6+570570]”] [source=BaseDaemon] [thread_id=27]
【Resource Configuration】
【Attachments: Screenshots/Logs/Monitoring】
Data replication process stalled
mysql> SELECT * FROM information_schema.tiflash_replica;
±-------------±---------------------±---------±--------------±----------------±----------±---------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_ID | REPLICA_COUNT | LOCATION_LABELS | AVAILABLE | PROGRESS |
±-------------±---------------------±---------±--------------±----------------±----------±---------+
| test | GLOBAL_VARIABLES | 107 | 1 | | 0 | 0 |
| test | cluster_config | 109 | 1 | | 0 | 0 |
| test | warehouse | 288 | 1 | | 0 | 0 |
| test | history | 318 | 1 | | 0 | 0 |
| test | new_order | 328 | 1 | | 0 | 0 |
| test | orders | 338 | 1 | | 0 | 0 |
| test | item | 368 | 1 | | 0 | 0 |
| test | t_0008 | 523 | 1 | | 0 | 0 |
| test | t_0010 | 528 | 1 | | 0 | 0 |
| test | t_0001 | 530 | 1 | | 0 | 0 |
| test | t1 | 687 | 1 | | 0 | 0 |
| test | t2 | 692 | 1 | | 0 | 0 |
| test | t3 | 696 | 1 | | 0 | 0 |
| test | t4 | 698 | 1 | | 0 | 0 |
| test | t_test | 703 | 1 | | 0 | 0 |
| test | cluster_config_612 | 716 | 1 | | 0 | 0 |
| test | GLOBAL_VARIABLES_612 | 718 | 1 | | 0 | 0 |
| test | cluster_config_650 | 782 | 1 | | 0 | 0 |
| test | GlOBAL_VARIABLES_650 | 784 | 1 | | 0 | 0 |
| test | stock | 358 | 1 | | 0 | 0 |
| test | customer | 308 | 1 | | 0 | 0 |
| test | order_line | 348 | 1 | | 0 | 0 |
| test | district | 298 | 1 | | 0 | 0.12 |
±-------------±---------------------±---------±--------------±----------------±----------±---------+
23 rows in set (0.21 sec)
Tiflash process keeps restarting:
[tidb@olap-high-01 log]$ ps -ef|grep tiflash
tidb 474164 1 99 17:26 ? 00:00:12 bin/tiflash/tiflash server --config-file conf/tiflash.toml
tidb 475271 452527 0 17:27 pts/0 00:00:00 grep --color=auto tiflash
[tidb@olap-high-01 log]$ ps -ef|grep tiflash
tidb 474164 1 99 17:26 ? 00:00:13 bin/tiflash/tiflash server --config-file conf/tiflash.toml
tidb 475273 452527 0 17:27 pts/0 00:00:00 grep --color=auto tiflash
[tidb@olap-high-01 log]$ ps -ef|grep tiflash
tidb 474164 1 99 17:26 ? 00:00:13 bin/tiflash/tiflash server --config-file conf/tiflash.toml
tidb 475275 452527 0 17:27 pts/0 00:00:00 grep --color=auto tiflash
[tidb@olap-high-01 log]$ ps -ef|grep tiflash
tidb 475285 452527 0 17:27 pts/0 00:00:00 grep --color=auto tiflash
[tidb@olap-high-01 log]$ ps -ef|grep tiflash
tidb 475287 452527 0 17:27 pts/0 00:00:00 grep --color=auto tiflash
[tidb@olap-high-01 log]$ ps -ef|grep tiflash
tidb 475289 452527 0 17:27 pts/0 00:00:00 grep --color=auto tiflash
[tidb@olap-high-01 log]$ ps -ef|grep tiflash
tidb 475290 1 99 17:27 ? 00:00:08 bin/tiflash/tiflash server --config-file conf/tiflash.toml
tidb 476381 452527 0 17:27 pts/0 00:00:00 grep --color=auto tiflash

| username: 裤衩儿飞上天 | Original post link

  1. Did you perform DDL operations on tables under the test database simultaneously?
  2. How about trying it on a single table?
| username: tracy0984 | Original post link

  1. No DDL operations.
  2. Haven’t tried this yet. But the previous version 6.1.2 of TiFlash didn’t have this issue before the upgrade.
    Can you understand that error? I can’t figure out why it’s reporting an error…
| username: 裤衩儿飞上天 | Original post link

Received an illegal instruction, and I can’t understand the rest either.
My current environment is also 6.1.2, I’ll try to upgrade and see if I can reproduce it.

| username: 裤衩儿飞上天 | Original post link

I haven’t had time to upgrade my environment yet. Can you take a look and see if this is related?

| username: tracy0984 | Original post link

After the upgrade, I found that TiFlash was not working well, so I scaled down TiFlash. I added 2 new TiFlash nodes, and the configurations for both the old and new versions of TiFlash are basically default, only the port numbers were changed. The newly added TiFlash nodes are reporting this error. There hasn’t been much change… Is it possible that the newly added nodes are unrelated to what you mentioned?
$ cat scale-out.yaml
tiflash_servers:

  • host: 10...11
    tcp_port: 9900
    http_port: 8923
  • host: 10...12
    tcp_port: 9900
    http_port: 8923
| username: 裤衩儿飞上天 | Original post link

If your previous Tiflash has been completely scaled down and new nodes have been added, it should not be the issue I mentioned above.

Let me summarize your process first:

  1. Upgraded from 6.1.2 to 6.5, and found issues with Tiflash after the upgrade (specific issues unknown).
  2. Scaled down the original Tiflash nodes (did you scale down all Tiflash nodes?).
  3. Added 2 new Tiflash nodes (host: 10...11 and host: 10...12).
  4. Executed ALTER DATABASE test SET TIFLASH REPLICA 1; and encountered the above error (did both newly added nodes have related errors?).

I am not sure if my description is complete, please supplement any related questions.

| username: tracy0984 | Original post link

The process you described is correct.

  1. The specific issue is that the TiFlash replica is unavailable, and recreating the replica doesn’t help. It was also found that TiFlash keeps restarting.
  2. All nodes were scaled down.
  3. Both nodes reported errors.

I also scaled down TiFlash and then scaled it up again. When no TiFlash replica was set for the table, TiFlash was fine. However, when I tested adding 1 replica to a single table, one TiFlash node started to restart continuously, while the other TiFlash node did not restart.
The replica of the table is in an unavailable state:

mysql> SELECT * FROM information_schema.tiflash_replica;
+--------------+------------+----------+---------------+-----------------+-----------+----------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_ID | REPLICA_COUNT | LOCATION_LABELS | AVAILABLE | PROGRESS |
+--------------+------------+----------+---------------+-----------------+-----------+----------+
| test         | t1         |      687 |             1 |                 |         0 |        0 |
+--------------+------------+----------+---------------+-----------------+-----------+----------+
1 row in set (0.04 sec)

The restarting TiFlash node reported errors:

[2023/01/06 09:51:19.898 +08:00] [ERROR] [BaseDaemon.cpp:377] [########################################] [source=BaseDaemon] [thread_id=22]
[2023/01/06 09:51:19.898 +08:00] [ERROR] [BaseDaemon.cpp:378] ["(from thread 21) Received signal Illegal instruction(4)."] [source=BaseDaemon] [thread_id=22]
[2023/01/06 09:51:19.899 +08:00] [ERROR] [BaseDaemon.cpp:473] ["Illegal operand."] [source=BaseDaemon] [thread_id=22]
[2023/01/06 09:51:23.441 +08:00] [ERROR] [BaseDaemon.cpp:570] ["\n       0x61214f1\tfaultSignalHandler(int, siginfo_t*, void*) [tiflash+101848305]\n                \tlibs/libdaemon/src/BaseDaemon.cpp:221\n  0x7fedf85d2070\t<unknown symbol> [libc.so.6+266352]\n       0x682303b\tDB::countBytesInFilter(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul> const&) [tiflash+109195323]\n                \tdbms/src/Columns/ColumnsCommon.cpp:66\n       0x16d4d10\tDB::DM::DMVersionFilterBlockInputStream<1>::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool) [tiflash+23940368]\n                \tdbms/src/Storages/DeltaMerge/DMVersionFilterBlockInputStream.cpp:373\n       0x6a93168\tDB::DM::BoundedSSTFilesToBlockInputStream::read() [tiflash+111751528]\n                \tdbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp:307\n       0x16d8a44\tDB::DM::SSTFilesToDTFilesOutputStream<std::__1::shared_ptr<DB::DM::BoundedSSTFilesToBlockInputStream> >::write() [tiflash+23956036]\n                \tdbms/src/Storages/DeltaMerge/SSTFilesToDTFilesOutputStream.cpp:200\n       0x6a89c8f\tDB::KVStore::preHandleSSTsToDTFiles(std::__1::shared_ptr<DB::Region>, DB::SSTViewVec, unsigned long, unsigned long, DB::DM::FileConvertJobType, DB::TMTContext&) [tiflash+111713423]\n                \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:360\n       0x6a89364\tDB::KVStore::preHandleSnapshotToFiles(std::__1::shared_ptr<DB::Region>, DB::SSTViewVec, unsigned long, unsigned long, DB::TMTContext&) [tiflash+111711076]\n                \tdbms/src/Storages/Transaction/ApplySnapshot.cpp:275\n       0x6ae4666\tPreHandleSnapshot [tiflash+112084582]\n                \tdbms/src/Storages/Transaction/ProxyFFI.cpp:388\n  0x7fedf99fbe28\tengine_store_ffi::_$LT$impl$u20$engine_store_ffi..interfaces..root..DB..EngineStoreServerHelper$GT$::pre_handle_snapshot::h14b79dee0dc709dc [libtiflash_proxy.so+17415720]\n  0x7fedf99f3899\tengine_store_ffi::observer::pre_handle_snapshot_impl::ha6024895a135c813 [libtiflash_proxy.so+17381529]\n  0x7fedf99e7236\tyatp::task::future::RawTask$LT$F$GT$::poll::haa9f3b11a91f3b25 [libtiflash_proxy.so+17330742]\n  0x7fedfb7f59b3\t_$LT$yatp..task..future..Runner$u20$as$u20$yatp..pool..runner..Runner$GT$::handle::h3ec77eaab16f0540 [libtiflash_proxy.so+48847283]\n  0x7fedfb7e820c\tstd::sys_common::backtrace::__rust_begin_short_backtrace::h2dafe5d145aa6fe4 [libtiflash_proxy.so+48792076]\n  0x7fedfb7e8ccc\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h8730f42d18399d62 [libtiflash_proxy.so+48794828]\n  0x7fedfaf63375\tstd::sys::unix::thread::Thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+39859061]\n                \t/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108\n  0x7fedf861c4ca\t<unknown symbol> [libc.so.6+570570]"] [source=BaseDaemon] [thread_id=22]

I checked the data directory of the TiFlash nodes and found that the node that did not restart did not have the folder for this table:

$ ll /db/tidb-data/tiflash-9900/data/
total 0
drwxr-xr-x 2 tidb tidb 6 Jan  6 09:33 system

The node that keeps restarting has:

$ ll /db/tidb-data/tiflash-9900/data/
total 0
drwxr-xr-x 2 tidb tidb  6 Jan  6 09:33 system
drwxr-xr-x 3 tidb tidb 20 Jan  6 09:39 t_687

Current cluster status:

| username: flow-PingCAP | Original post link

Please run this command on the machine where TiFlash is deployed to check if your CPU supports AVX2 instructions:

cat /proc/cpuinfo | grep avx2

Paste the output text here.

| username: 裤衩儿飞上天 | Original post link

Try setting a table with two replicas and see if both nodes report errors. I’m currently upgrading to see if I can reproduce the issue.

| username: dba-kit | Original post link

I also upgraded to 6.5.0 last week, and it went smoothly. During the upgrade process, TiFlash will automatically upgrade to this version. In my case, it took just over ten minutes to fully convert 500GB of TiFlash, without any manual intervention needed.

| username: 裤衩儿飞上天 | Original post link

After my upgrade, I did not encounter the issue you mentioned. Please follow this reply to check if your CPU does not support AVX2 instructions.

| username: tidb狂热爱好者 | Original post link

Delete these nodes. Additionally, for the IP with flash, you only need to add one.

| username: tracy0984 | Original post link

It is not supported, right?

| username: 裤衩儿飞上天 | Original post link

It should not be supported.

I thought of this command yesterday, but I was in a hurry to go home after work and forgot to reply. Today, I got a bit off track. :crazy_face: :crazy_face: :crazy_face:

| username: tracy0984 | Original post link

Which nodes should be deleted? Adding 2 Tiflash nodes shouldn’t be a problem, right? I’m using version 6.1.2 with the same configuration and it’s working fine…

| username: tracy0984 | Original post link

If AVX2 is not supported, does that mean TiFlash cannot be used?

| username: 裤衩儿飞上天 | Original post link

The error message in your report is: Illegal instruction [“(from thread 26) Received signal Illegal instruction(4).”]

| username: flow-PingCAP | Original post link

Check what instructions your CPU supports:
cat /proc/cpuinfo | grep flags

| username: tracy0984 | Original post link

cat /proc/cpuinfo | grep flags

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault pti tpr_shadow vnmi flexpriority ept vpid tsc_adjust xsaveopt arat umip arch_capabilities