Tag Archives: NVMe

NVMe over Fabrics: Fibre Channel vs. RDMA

In the last few years, enterprises have been getting hungrier for infrastructure that provides high throughput with low latency and greater performance for hosted applications. Faster networking with high-speed Ethernet, Fibre Channel, and Infiniband offers end-to-end speed varying from 10 Gb/s to 128 Gb/s.

Enterprises are also starting to realize the performance and latency benefits offered by the NVMe protocol with storage arrays featuring high-speed NAND flash and next-generation SSDs.

But a latency bottleneck has arisen in the implementation of shared storage or storage area networking where data needs to be transferred between the host (initiator) and the NVMe-enabled storage array (target) over Ethernet, RDMA technologies (iWARP/RoCE), or Fibre Channel.

The NVMe bottleneck

Latency gets high when SCSI commands transported by Fibre Channel require interpretation and translation into NVMe commands.

NVMe over fabrics (NVMe-oF) is a network protocol introduced by NVM Express to address this bottleneck. NVMe-oF replaced iSCSI as a storage networking protocol, allowing enterprises to experience the full benefits offered by NVMe-enabled storage arrays. NVMe-oF acts as a messaging layer between the host computer and target SSDs or a shared system network over ultra-high speed RDMAs/Fibre Channels.

NVMe-oF supports five technologies: RDMA (RoCE, iWARP), Fibre Channel (FC-NVMe), Infiniband, Future Fabrics, and Intel Omni-Path architecture.

In addition, NVMe-oF allows separation of control traffic and data traffic, which further simplifies traffic management. Also, it takes advantage of the internal parallelism of storage devices and lowers I/O overhead. This enhances overall data access performance to reduce latency.

NVMe-oF offers a performance boost to enterprises that are deploying machine learning applications, big data, and Internet of Things (IoT) analytics, which demand real-time access to stored data without any distance dependencies. 

Performance evaluation of NVMe-oF over Fibre Channel and RDMA

Recent conferences have sparked debate about which transport channel delivers the best performance using the NVMe-oF protocol. Some vendors firmly believe that RDMA is a better option for higher throughput, and many vendors stick to Fibre Channel to gain performance advantages.

Both network fabric technologies have their own benefits and pitfalls.

NVMe over Fabrics using Fibre Channel

NVMe over Fibre Channel relies on two standards: NVMe-oF and FC-NVMe. NVMe-oF is the protocol offered by NVM Express organization for enabling transportation of NVMe traffic over network fabric, and FC-NVMe is the Fibre Channel-specific transport standard. The combination of both serves as a solution. A majority of enterprises are already using Fibre Channel technology to process their critical data to and from storage arrays.

Fibre Channel was specially designed for storage device and systems, and it is the de facto standard for enterprise storage area networking (SAN) solutions. The main advantage of Fibre Channel technology is that it provides concurrent traffic for existing traditional storage protocols — SCSI — and the new NVMe protocol using the same hardware resources in storage infrastructure. This co-existence of SCSI and NVMe on Fibre Channel benefits most of enterprises because they can enable NVMe operations with just a simple software upgrade.

In March 2018, NVM Express added a new feature called Asymmetric Namespace Access (ANA) to the NVMe-oF protocol. This allows multi-path I/O support among multiple hosts and namespaces.

Gen 5 and Gen 6 are new versions of Fibre Channel. Gen 6 supports transfer speeds up to 128Gbs, i.e. the highest in storage networking. Additionally, Gen 6 enables monitoring and diagnostics capabilities that enable visibility into latency levels and IOPS. NVMe-oF seamlessly integrates with both new versions of Fibre Channel protocols.

As per a Demartek report, NVMe over Fibre Channel delivers 58% higher IOPS and 34% lower latency than SCSI-based Fibre Channel protocol. Large enterprises favor the use of FC-NVMe for processing critical workloads due to its simplicity, reliability, predictability, and performance.

However, this implementation requires more expertise at the storage networking level, which may add costs.

NVMe over Fabrics using RDMA

RDMA offers an alternative to Fibre Channel. According to WhatIs.com, “Remote Direct Memory Access (RDMA) is a technology that allows computers in a network to exchange data in main memory without involving the processor, cache, or operating system of either computer.”

In other words, RDMA allows applications to bypass the software stack for processing network traffic. Because RDMA data transfer does not involve so many resources, RDMA helps enterprises achieve higher throughput and better performance with lower latency. NVMe-enabled storage devices appear to be near to the host with RDMA.

RDMA can be enabled in storage networking with protocols like RoCE (RDMA over Converged Ethernet), iWARP (internet wide area RDMA protocol), and Infiniband.

iWARP is roughly an RDMA over TCP/IP. It uses TCP and Stream Control Transmission Protocol (SCTP) for data transmission.

RoCE enables RDMA over Ethernet. It is described as Inifiniband over Ethernet. There are two versions of RoCE v1 and RoCE v2. Both of these protocols are incompatible with each other due to different transport mechanisms.

Inifiniband is largely supported by vendors offering high-performance computing solutions. It is the fastest RDMA storage networking technology having data transfer speed around 100 Gbs, compared to the up to 128 Gb/s offered by Gen 6 FC-NVME. Like FC-NVMe, Infiniband is a lossless transmission protocol, providing quality of service (QoS) mechanism, along with credit-based flow control.

Some vendors consider RDMA to be highly compatible with NVMe use cases due to their use of the same queueing structure. The main reason for using RDMA-based technologies is that command transfer does not require any kind of encapsulation and translation of commands as both use the similar queueing structure for data transfer without CPU intervention. This way RDMA saves CPU cycles, which lowers latency in data transmission from hosts to storage devices.

Key differentiators

  • With Fibre Channel, enterprises can preserve their existing hardware investment along with taking full advantage of complete NVMe-enabled storage infrastructure. But NVMe-oF implementations based on Infiniband, RDMA (iWARP or RoCE), and Ethernet often require new hardware resources for enterprises.
  • Fibre Channel fabric has a flow control “buffer-to-buffer credit” feature with which it assures the quality of service (QoS) for enterprises by providing lossless network traffic. RDMA Ethernet (iWARP and RoCE) require additional protocol support to enable this feature.
  • As compared to other network fabric options, Fibre Channel requires less configuration to initiate network traffic.
  • Fibre Channel fabric has a feature to automatically discover and add host initiator and target storage devices and their properties. RDMA Ethernet (iWARP and RoCE) and Infiniband lack this capability.


As per a 2016 NVMe ecosystem market sizing report published by G2M Research, the NVMe market will be worth more than $57 billion by 2020, and more than 50% of enterprise servers will have NVMe-enabled by 2020.

NVMe over Fabrics takes the NVMe boost to a network, providing efficient, reliable and highly agile storage networks to be used for advanced use cases like artificial intelligence/machine learning, IoT, real-time analytics, and mission-critical applications.

But enterprises have to evaluate their investment capabilities based on different kinds of NVMe-oF implementations. RDMA offers advantages which are suited for advanced use cases (considering real-time access to storage), but enterprises can also leverage FC-NVMe by transitioning to the Gen 6 version which offers the highest data transfer speed with low latency.

In upcoming years, NVMe integration will be crucial for enterprises that are transitioning their IT infrastructure ecosystem for digital transformation.

Source link

The NVMe Transition

The buzzword of the moment in the storage industry is NVMe, otherwise known as Non-Volatile Memory Express. NVMe is a new storage protocol that vastly improves the performance of NAND flash and storage class memory devices. How is it being implemented and are all NVMe-enabled devices equal? And what should IT infrastructure pros consider before making the NVMe transition?


NVMe was developed as a successor to existing SAS and SATA protocols. Both SAS and SATA were designed for the age of hard drives where mechanical head movement masked any storage protocol inefficiencies. Today with NAND flash, and in the future with storage class memory, the bottlenecks of SAS/SATA are more apparent because NAND flash is such a high-performance persistent media. NVMe addresses these performance problems and also implements greater parallel operations. The result is around a 10x improvement in IOPS for NVMe solid-state drives compared to SAS/SATA SSDs.

Adoption models

Storage vendors are starting to roll out products that replace their existing architectures with ones based on NVMe. At the back-end of traditional storage arrays, drives have been connected using SAS. In recent weeks, both Dell EMC and NetApp have announced updates to their product portfolios that replace SAS with NVMe.

Dell EMC released PowerMax, the NVMe successor to VMAX. NetApp introduced AFF A800, which includes NVMe shelves and drives. In both cases, the vendors claim latency improves to around the 200-300µs level, with up to 300GB per second of throughput. Remember that both of these platforms scale out, so these estimates are for systems at their greatest level of scale.

Pure Storage recently announced an update to its FlashArray//X platform with the release of the //X90 model. This offers native NVMe through the use of DirectFlash modules. In fact, the FlashArray family has been NVMe-enabled for some time, which means the transition for customers can be achieved without a forklift upgrade, whereas PowerMax and AFF A800 are new hardware platforms.

NVMe is already included in systems from other vendors such as Tegile, which brought its NVMe-enabled platforms to market in August 2017. Vexata has also implemented both NVMe NAND and Optane in a hardware product specifically designed for NVMe media. The Optane version of the VX-100 platform can deliver latency figures as low as 40µs with 80GB/s of bandwidth in just two controllers, Vexata claims.

End-to-end NVMe

A new term we’re starting to see emerge is end-to-end NVMe. This means that from host to drive, each step of the architecture is delivered with the NVMe protocol. The first step was to enable back-end connectivity through NVMe; the next step is to enable NVMe from host to array.

Existing storage arrays have used either Fibre Channel or iSCSI for host connections. Fibre Channel actually uses the SCSI protocol and of course, iSCSI is SCSI over Ethernet. A new protocol, NVMeoF, or NVMe over Fabrics, allows the NVMe protocol to be used on either Fibre Channel or Ethernet networks.

Implementing NVMeoF for Ethernet requires new adaptor cards, whereas NVMeoF for Fibre Channel will work with the latest Gen5 16Gb/s and Gen6 32Gb/s hardware. However, it’s early days for both of these protocols, so don’t expect them to have the maturity of existing storage networking.

Controller bottlenecks

One side effect of having faster storage media is the ability to max out the capability of the storage controller. A single Intel Xeon processor can fully exploit perhaps only four to five NVMe drives, which means storage arrays may not fully exploit the capabilities of the NVMe drive itself.

Vendors have used two techniques to get around this problem. The first is to implement scale-out architectures, with multiple nodes deploying compute and storage;  WekaIO and Excelero use this approach. Both vendors offer software-based solutions that deliver scale-out architectures specifically designed for NVMe. WekaIO Matrix is a scale-out file system, whereas Excelero NVMesh is a scale-out block storage solution. In both instances, the software can be implemented in a traditional storage array design or used in a hyperconverged model.

The second approach is to disaggregate the functions of the controller and allow the host to talk directly to the NVMe drives. This is how products from E8 Storage and Apeiron Data work. E8 storage appliances package up to 24 drives in a single shelf, which is directly connected to host servers over 100Gb/s Ethernet or Infiniband. The result is up to 10 million read IOPS and 40GB/s of bandwidth at latency levels close to those of the SSD media itself.

Apeiron’s ADS1000 uses custom FPGA hardware and hardened layer 2 Ethernet to connect hosts directly to NVMe drives using a protocol the vendor calls NVMe over Ethernet. The product offers near line-speed connectivity with only a few microseconds of latency on top of the media itself. This allows a single drive enclosure to deliver around 18 million IOPS with around 72GB/s of sustained throughput.


So what’s the best route to using NVMe technology in your data center? Moving to traditional arrays with an NVMe back-end would provide an easy transition for customers that already use technology from the likes of Dell or NetApp. However, these arrays may not fully benefit from the performance NVMe can offer because of bottlenecks at the controller and delays introduced with existing storage networking.

The disaggregated alternatives offer higher performance at much lower latency, but won’t simply slot into existing environments. Hosts potentially need dedicated adaptor cards, faster network switches, and host drivers.

As with any transition, IT organizations should be reviewing requirements to see where NVMe benefits their needs. If ultra-low latency is important, then this could justify implementing a new storage architecture.

Remember that NVMe will — in the short-term at least — be sold at a premium, so it also makes sense to ensure the benefits of the transition to NVMe justify the cost.

Source link

What NVMe over Fabrics Means for Data Storage

NVMe-oF will speed adoption of Non-Volatile Memory Express in the data center.

The last few years have seen Non-Volatile Memory Express (NVMe) completely revolutionize the storage industry. Its wide adoption has driven down flash memory prices. With lower prices and better performance, more enterprises and hyper-scale data centers are migrating to NVMe. The introduction of NVMe over Fabrics (NVMe-oF) promises to accelerate this trend.

The original base specification of NVMe is designed as a protocol for storage on flash memory that uses existing, unmodified PCIe as a local transport. This layered approach is very important. NVMe does not create a new electrical or frame layer; instead it takes advantage of what PCIe already offers. PCIe has a well-known history as a high speed interoperable bus technology. However, while it has those qualities, it’s not well suited for building a large storage fabric or covering distances longer than a few meters. With that limitation, NVMe would be limited to being used as a direct attached storage (DAS) technology, essentially connecting SSDs to the processor inside a server, or perhaps to connect all-flash arrays (AFA) within a rack. NVMe-oF allows things to be taken much further.

Connecting storage nodes over a fabric is important as it allows multiple paths to a given storage resource. It also enables concurrent operations to distributed storage, and a means to manage potential congestion. Further, it allows thousands of drives to be connected in a single pool of storage, since it is no longer limited by the reach of PCIe, but can also take advantage of a fabric technology like RoCE or Fibre Channel.

NVMe-oF describes a means of binding regular NVMe protocol over a chosen fabric technology, a simple abstraction enabling native NVMe commands to be transported over a fabric with minimal processing to map the fabric transport to PCIe and back.  Product demonstrations have shown that the latency penalty for accessing an NVMe SSD over a fabric as opposed to a direct PCIe link can be as low as 10 microseconds.

The layered approach means that a binding specification can be created for any fabric technology, although some fabrics may be better suited for certain applications. Today there are bindings for RDMA (RoCE, iWARP, Infiniband) and Fibre Channel. Work on a binding specification for TCP/IP has also begun.

Different products will use this layered capability in different ways. A simple NVMe-oF target, consisting of an array of NVMe SSDs, may expose all of its drives individually to the NVMe-oF host across the fabric, allowing the host to access and manage each drive individually. Other solutions may take a more integrated approach, using the drives within the array to create one big pool of storage offered that to the NVMe-oF initiator. With this approach, management of drives can be done locally within the array, without requiring the attention of the NVMe-oF initiator, or any higher layer software application. This also allows for the NVMe-oF target to implement and offer NVMe protocol features that may not be supported by drives within the array.

A good example of this is a secure erase feature. A lower cost drive may not support the feature, but if that drive is put into a NVMe-oF AFA target, the AFA can implement that secure erase feature and communicate to the initiator. The NVMe-oF target will handle the operations to the lower cost drive in order to properly support the feature from the perspective of the initiator. This provides implementers with a great deal of flexibility to meet customer needs by varying hardware vs. software feature implementation, drive cost, and performance.

The recent plugfest at UNH-IOL focused on testing simple RoCE and Fibre Channel fabrics. In these tests, a single initiator and target pair were connected over a simple two switch fabric. UNH-IOL performed NVMe protocol conformance testing, generating storage traffic  to ensure data could be transferred error-free. Additionally, testing involved inducing network disruptions to ensure the fabric could recover properly and transactions could resume.

In the data center, storage is used to support many different types of applications with an unending variety of workloads. NVMe-oF has been designed to enable flexibility in deployment, offering choices for drive cost and features support, local or remote management, and fabric connectivity. This flexibility will enable wide adoption. No doubt, we’ll continue to see expansion of the NVMe ecosystem.

Source link

NVMe and NVMs: What To Expect

NVMe is a relatively new protocol for accessing data stored on solid-state drives. Unlike spinning disks, SSDs store data on some form of non-volatile memory (NVM). This NVM can be either flash (NAND) or a next-generation NVM such as 3D XPoint (3DXP). Note that NVMe, the protocol, is different from NVM, the storage medium.

NVMe (the “e” stands for express) is designed to be leaner and faster than its predecessors, SAS and SATA. It shaves off about 20us from the latency added by the I/O stack. This improvement is negligible compared to the internal latency of a spinning disk (5000us), but it is noticeable compared to the internal latency of a flash SSD (100us), and it would be dramatic compared to the internal latency of a future SSD with 3DXP (less than 10us). So, while flash SSDs are available with SAS/SATA or NVMe interfaces, 3DXP SSDs will be available with NVMe only.


Besides improving latency, NVMe improves the bandwidth to each SSD. It connects the CPU to the SSDs directly over PCIe, which means there is no need for an intervening HBA, and a greater number of PCIe lanes can be employed.  A SAS lane runs at 12Gb/s, which shrinks to about 1GB/s after overheads. A SATA lane supports half of that. A PCIe lane runs at 1GB/s, and a typical NVMe SSD can be connected to four such lanes, supporting up to 4GB/s. Indeed, NVMe enthusiasts are quick to compare a SATA SSD running at 0.5GB/s and an NVMe SSD running at 3GB/s. That’s 6x higher throughput!

But a storage system contains multiple SSDs, typically more than 10. With so many SSDs, drive-level throughput is rarely the bottleneck or determinant of system-level throughput.

System-level performance

In general, the performance of a storage system is bound by one of the following resources:

  • The front-end network connecting applications to storage.
  • The CPUs running the storage software.
  • The I/O interconnect between the CPUs and storage drives or modules. For a system using SAS/SATA drives, this includes PCIe lanes, a SAS HBA, SAS lanes, and perhaps a SAS expander. The total bandwidth of such an interconnect is generally 4-12GB/s. For a system using NVMe, the interconnect includes PCIe lanes and perhaps a PCIe switch. The total bandwidth of this interconnect is generally 8-24GB/s.
  • The storage drives, including the storage medium and the medium controllers.


Which of these four becomes the performance bottleneck depends on the system architecture and the workload such as reads vs. writes and random vs. sequential.

Traditional storage systems using disk drives are generally drive bound. However, modern systems using flash drives behave quite differently, because flash drives are much faster than disk drives. For most workloads, flash-based systems are CPU bound. Most of the CPU is consumed in providing data services such as high availability, data reduction, and data protection.

Less common, a flash-based system might be drive bound. This could happen if the system has a small number of SSDs, if it does not distribute the load across the drives, or if it uses older drives that cannot fully utilize the SAS/SATA interface. Even less common, the system might be bound by the interconnect or the front-end network. This could happen for selected workloads, e.g., bursts of sequential I/O using large IO size.  Or, it could happen if the storage system is designed to provide raw performance at the expense of sophisticated data services.

When a system is CPU bound, the use of NVMe instead of SAS/SATA still might improve performance because the NVMe driver is more CPU efficient than the SCSI driver. But this gain is modest—less than 20%—because most of the CPU is consumed by data services, not protocol drivers.

Your mileage might vary, and you should ask any storage vendor offering NVMe about what performance gain you should expect on your workload, not their benchmarks.

Fortunately, NVMe can be incorporated into a storage system with straightforward changes in the interconnect layout, without a major change to the storage architecture at large. There is one hitch: NVMe SSDs with dual ports are expensive. But their price is likely to drop to near that of SATA SSDs. So, over time, all flash-based systems will adopt NVMe. Some might adopt it sooner than the others, but it is not a fundamental differentiator.

Overall, using NVMe SSDs in a storage system is like using “high-performance tires” on a car. In most cases, they provide a modest gain in performance and do not require a change to the engine. Nice to have, but not a fundamental differentiator.

Perhaps of greater interest is a recent extension of NVMe known as NVMe over Fabrics (NVMf). NVMf executes I/O across hosts using RDMA-capable networks such as RoCE. While NVMe over PCIe shaves off about 10us relative to SAS, NVMf can shave off about 100us from the roundtrip latency between two hosts relative to protocols such as iSCSI. It also saves CPU usage from TCP/IP processing. This can be particularly beneficial in scale-out systems for transferring data between hosts. It does require RDMA-capable NICs and DCB-capable switches, so it will take some time for mass adoption.

3D XPoint SSDs

While NVMe is nice to have for flash SSDs, it’s critical for 3DXP SSDs.

This is not surprising given that Intel, which led the release of NVMe in 2011, is also the co-creator of 3DXP. The internal latency for 3DXP SSDs is less than 10us, which is far quicker than 100us for flash SSDs. This means that workloads with low queue depth — with few IOs outstanding at any time — will run much faster on 3DXP SSDs than they would on flash SSDs. If one were to use a 3DXP SSD with SAS instead of NVMe, it would more than triple the latency and take a big bite out of the lure of 3DXP.

With access latency of 10us, 3DXP is a more fundamental change than NVMe alone. It introduces a new and differentiated layer in the storage media pyramid—between flash and NVRAM (based on DRAM).

Relative to a flash SSD, a 3DXP SSD will be 10x faster at low queue depth, 10x more endurant in number of writes, but also 10x more expensive per gigabyte. Given the 10x difference in price and performance, it would be beneficial to combine flash and 3DXP SSDs such that flash is used for storing data and 3DXP is used for storing metadata or caching data. This will make hybrid flash+3DXP systems more attractive than pure flash systems.


Relative to an NVRAM DIMM, a 3DXP SSD is more than 10x slower, far less endurant, and 10x cheaper. Therefore, use cases such as write caching that are the most sensitive to latency and endurance but do not need as large capacity will continue to function optimally by using NVRAM.

Eventually, the full potential of 3DXP also will be realized not as SSDs using NVMe, but as NVDIMMs on the memory channel. This is because the true latency of 3DXP memory is claimed to be less than 1us, and wrapping it up into an SSD appears to increase the latency to 10us.

And so the tick-tock progression of storage protocols and storage media will continue. A new storage protocol is a modest tick. A new storage medium is a tock!

Umesh Maheshwari, Nimble Storage founder and CTO, is responsible for defining the company’s product architecture and developing core technologies. Before founding Nimble, he served as an early architect at Data Domain where he developed parts of their deduplicating file system and WAN-efficient replication. Previously, Umesh was at Zambeel, a maker of scalable file servers, where he developed a clusterized metadata service and automatic network configuration. Prior to Zambeel, he was at InterTrust. Umesh holds a PhD in computer science from MIT. He also holds a BTech in computer science from IIT Delhi, where he received the President’s Gold Medal as the top graduating student.

Source link