Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.
Original topic: TiKV节点使用nvme本地磁盘需要做raid吗
[TiDB Usage Environment] Production Environment
[TiDB Version]
[Reproduction Path] What operations were performed when the issue occurred
[Encountered Issue: Problem Phenomenon and Impact]
For a TiKV cluster deployed in a production environment, when using multiple NVMe local disks, is it necessary to implement RAID protection?
Currently, there are two main RAID solutions for NVMe disks:
1. Hardware RAID card, with limited brand options
2. Using Intel CPU’s VROC module for software RAID
If RAID is not implemented and relying on TiKV’s cluster multi-replica, can disk bad blocks be detected quickly?
[Resource Configuration] Navigate to TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page
[Attachments: Screenshots/Logs/Monitoring]
A cluster with multiple replicas can only detect data inconsistency issues, but it cannot determine whether the inconsistency is caused by bad disk sectors. I think you should consider two options:
- Use RAID, which will incur performance loss and increased costs.
- Do not use RAID, but perform regular disk checks. If bad sectors are detected, migrate the TiKV node to another server.
To put it simply, my understanding is that TiDB cannot provide early warnings when a disk fails. When the disk fails, the KV node should log error messages, and related monitoring will also be available. Using multiple replicas can ensure data is not lost unless all replicas fail. Implementing RAID adds an extra layer of protection at the physical disk level, which is quite good.
When using multiple NVMe local disks, RAID protection is needed
The vendor suggests not using it. If you are concerned, you can choose to: 1. Change to 5 replicas, 2. Use software RAID at the operating system level.
There’s no need to consider hardware RAID for NVMe.
If you don’t use RAID, you can install a disk information query tool and regularly inspect the disks for errors, lifespan, etc.
What I understand is that there are currently no good RAID cards for NVMe. In general, NVMe in production environments does not use RAID; instead, TiDB relies on its own three replicas and labeling for high availability and disaster recovery.
Personally, I feel that it’s better to use RAID. Sacrificing a bit of performance makes it easier to maintain in case of simple hard drive failures (you can directly replace the hard drive without having to take down the node). Not using RAID and relying solely on the database’s multiple replicas is also an option, but it would require more rigorous bad sector detection and maintenance would be more troublesome.
For an online TiDB cluster, NVMe drives do not need to be configured with RAID. High availability and disaster recovery can be achieved through TiDB’s own three-replica mechanism and labeling.
Based on our experience, not only NVMe but also other types of SSDs and even regular hard drives in both testing and production environments rely on the cluster’s three-replica mechanism to ensure data safety. To maximize space utilization, RAID is not configured.
Of course, whether or not to use RAID ultimately depends on your specific considerations.
There’s no need, high availability and disaster recovery can be ensured through multiple replicas.
The biggest problem with not using RAID is that replacing disks requires coordination with the operating system and software. If the system has multiple software and hardware maintenance teams that are separate, not using RAID can be quite painful.
Based on our experience, using an Intel + software RAID solution results in a higher probability of disk failure. Since implementing this solution, the failure rate of NVMe disks with software RAID has been several times higher than with other solutions.
Nowadays, SSDs are already very fast, so we highly recommend a RAID + SSD solution.
It depends on the needs of the business and operations. From a business perspective: Raft can ensure the availability of the cluster when the majority of replicas are alive, but it cannot guarantee that there will be no impact on the business when a minority of replicas fail. This time is about tens of seconds, depending on whether the business can accept it. From an operations perspective: The advantage of RAID is that it reduces operational pressure when one disk fails. Personally, I recommend using hardware RAID if cost is not a major concern, as it saves a lot of hassle.
If resources are sufficient, RAID is necessary. Although TiKV has 3 replicas, there will still be some impact if one fails. Additionally, handling scaling in and out can cause unnecessary troubles.
I asked the supplier, and currently, NVMe does not have a RAID card; it can only do software RAID.
The multiple replicas of TiKV can ensure that the system continues to run when a node fails. However, if the written data silently becomes corrupted, TiKV will not be aware of it.
General RAID cards can only manage SATA/SAS hard drives, which brings up an issue: NVMe drives (directly connected to the CPU/PCH) cannot be managed by RAID cards. This introduces a new problem: how to support NVMe with hardware RAID cards. Both DELL and Inspur have provided solutions, and Intel has introduced the VROC solution, which is a software RAID where the CPU directly handles the RAID tasks. This allows NVMe drives to achieve RAID.
Ordinary software RAID cannot be used as a system disk, but Intel’s can be used as a system disk. However, it doesn’t offer much advantage on servers.