Tag Archives: Data

Facebook Debuts Data Center Fabric Aggregator


At the Open Compute Project Summit in San Jose on Tuesday, Facebook engineers showcased their latest disaggregated networking design, taking the wraps off new data center hardware. Microsoft, meanwhile, announced an effort to disaggregate solid-state drives to make them more flexible for the cloud.

The Fabric Aggregator, built on Facebook’s Wedge 100 gigabit top-of-rack switch and Open Switching System (FBOSS) software, is designed as a distributed network system to accommodate the social media giant’s rapid growth. The company is planning to build its twelfth data center and is expanding one in Nebraska from two buildings to six.

“We had tremendous growth of east-west traffic,” Sree Sankar, technical product manager at Facebook said, referring to the traffic flowing between buildings in a Facebook data center region. “We needed a change in the aggregation tier. We were already using the largest chassis switch.”

The company needed a system that would provide power efficiency and have a flexible design, she said. Engineers used Wedge 100 and FBOSS as building blocks and developed a cabling assembly unit to emulate the backplane. The design provides operational efficiency, 60% better power efficiency, and higher port density. Sankar said Facebook was able to deploy it quickly in its data center regions in the past nine months. Engineers can easily scale Fabric Aggregator up or down according to data center demands.

“It redefines network capacity in our data centers,” she said.

Facebook engineers wrote a detailed description of Fabric Aggregator in a blog post. They submitted the specifications for all the backplane options to the OCP, continuing their sharing tradition. Facebook’s networking contributions to OCP include its Wedge switch and Edge Fabric traffic control system. The company has been a major proponent of network disaggregation, saying traditional proprietary network gear doesn’t provide the flexibility and agility they need.

Seven years ago, Facebook spearheaded the creation of the Open Compute Project with a focus on open data center components such as racks and servers. The OCP now counts more than 4,000 engineers involved in its various projects and more than 370 specification and design packages, OCP CEO Rocky Bullock said in kicking off this week’s OCP Summit, which drew some 3,000 attendees.  

Microsoft unveils Project Denali

While Facebook built on its disaggregated networking approach, Microsoft announced Project Denali, an effort to create new standards for flash storage to optimize it for the cloud through disaggregation.

Kushagra Vaid, general manager of Azure Infrastructure at Microsoft, said cloud providers are top consumers of flash storage, which amounts to billions of dollars in annual spending. SSDs, however, with their “monolithic architecture” aren’t designed to be cloud friendly, he said.  

Any SSD innovation requires that the entire device be tested, and new functionality isn’t provided in a consistent manner, he said. “At cloud scale, we want to drive every bit of efficiency,” Vaid said. Microsoft engineers wanted to figure out a way to provide the same kind of flexibility and agility with SSDs as disaggregation brought to networking.

“Why can’t we do the same thing with SSDs?” he said.

Project Denali “standardizes the SSD firmware interfaces by disaggregating the functionality for software-defined data layout and media management,” Vaid wrote in a blog post.

“Project Denali is a standardization and evolution of Open Channel that defines the roles of SSD vs. that of the host in a standard interface. Media management, error correction, mapping of bad blocks and other functionality specific to the flash generation stays on the device while the host receives random writes, transmits streams of sequential writes, maintains the address map, and performs garbage collection. Denali allows for support of FPGAs or microcontrollers on the host side,” he wrote.

Vaid said this disaggregation provides a lot of benefits. “The point of creating a standard is to give choice and provide flexibility… You can start to think at a bigger scale because of this disaggregation, and have each layer focus on what it does best.”

Microsoft is working with several partners including CNEX Labs and Intel on Project Denali, which it plans to contribute to the OCP.

Hear more from Facebook and the Open Compute Project when they present live at the Network Transformation Summit at Interop ITX, April 30 and May 1 in Las Vegas. Register now!



Source link

Data Protection in the Public Cloud: 6 Steps


While cloud security remains a top concern in the enterprise, public clouds are likely to be more secure than your private computing setup. This might seem counter-intuitive, but cloud service providers have a leverage of scale that allows them to spend much more on security tools than any large enterprise, while the cost of that security is diluted across millions of users to fractions of a cent.

That doesn’t mean enterprises can hand over all responsibility for data security to their cloud provider. There are still many basic security steps companies need to take, starting with authentication. While this applies to all users, it’s particularly critical for sysadmins. A password compromise on their mobiles could be the equivalent of handing over the corporate master keys. For the admin, multi-factor authentication practices are critical for secure operations. Adding biometrics using smartphones is the latest wave in the second or third part of that authentication; there are a lot of creative strategies!

Beyond guarding access to cloud data, what about securing the data itself? We’ve heard of major data exposures occurring when a set of instances are deleted, but the corresponding data isn’t. After a while, these files get loose and can lead to some interesting reading. This is pure carelessness on the part of the data owner.

There are two answers to this issue. For larger cloud setups, I recommend a cloud data manager that tracks all data and spots orphan files. That should stop the wandering buckets, but what about the case when a hacker gets in, by whatever means, and can reach useful, current data? The answer, simply, is good encryption.

Encryption is a bit more involved than using PKZIP on a directory. AES-256 encryption or better is essential. Key management is crucial; having one admin with the key is a disaster waiting to happen, while writing down on a sticky note is going to the opposite extreme. One option offered by cloud providers is drive-based encryption, but this fails on two counts. First, drive-based encryption usually has only a few keys to select from and, guess what, hackers can readily access a list on the internet. Second, the data has to be decrypted by the network storage device to which the drive is attached. It’s then re-encrypted (or not) as it’s sent to the requesting server. There are lots of security holes in that process.

End-to-end encryption is far better, where encryption is done with a key kept in the server. This stops downstream security vulnerabilities from being an issue while also adding protection from packet sniffing.

Data sprawl is easy to create with clouds, but opens up another security risk, especially if a great deal of cloud management is decentralized to departmental computing or even users. Cloud data management tools address this much better than written policies. It’s also worthwhile considering adding global deduplication to the storage management mix. This reduces the exposure footprint considerably.

Finally, the whole question of how to backup data is in flux today. Traditional backup and disaster recovery has moved from in-house tape and disk methods to the cloud as the preferred storage medium. The question now is whether a formal backup process is the proper strategy, as opposed to snapshot or continuous backup systems. The snapshot approach is growing, due to the value of small recovery windows and limited data loss exposure, but there may be risks from not having separate backup copies, perhaps stored in different clouds.

On the next pages, I take a closer look at ways companies can protect their data when using the public cloud.

(Image: phloxii/Shutterstock)



Source link

6 Ways to Transform Legacy Data Storage Infrastructure


So you have a bunch of EMC RAID arrays and a couple of Dell iSCSI SAN boxes, topped with a NetApp filer or two. What do you say to the CEO who reads my articles and knows enough to ask about solid-state drives, all-flash appliances, hyperconverged infrastructure, and all the other new innovations in storage? “Er, er, we should start over” doesn’t go over too well! Thankfully, there are some clever — and generally inexpensive — ways to answer the question, keep your job, and even get a pat on the back.

SSD and flash are game-changers, so they need to be incorporated into your storage infrastructure. SSDs are better than enterprise-class hard drives from a cost perspective because they will speed up your workload and reduce the number of storage appliances and servers needed. It’s even better if your servers support NVMe, since the interface is becoming ubiquitous and will replace both SAS and (a bit later) SATA, simply because it’s much faster and lower overhead.

As far as RAID arrays, we have to face up to the harsh reality that RAID controllers can only keep up with a few SSDs. The answer is either an all-flash array and keeping the RAID arrays for cool or cold secondary storage usage, or a move to a new architecture based on either hyperconverged appliances or compact storage boxes tailored for SSDs.

All-flash arrays become a fast storage tier, today usually Tier 1 storage in a system. They are designed to bolt onto an existing SAN and require minimal change in configuration files to function. Typically, all-flash boxes have smaller capacities than the RAID arrays, since they have enough I/O cycles to do near-real-time compression coupled with the ability to down-tier (compress) data to the old RAID arrays.

With an all-flash array, which isn’t outrageously expensive, you can boast to the CEO about 10-fold boosts in I/O speed, much lower latency , and as a bonus a combination of flash and secondary storage that usually has 5X effective capacity due to compression. Just tell the CEO how many RAID arrays and drives you didn’t buy. That’s worth a hero badge!

The idea of a flash front-end works for desktops, too. Use a small flash drive for the OS (C-drive) and store colder data on those 3.5” HDDs. Your desktop will boot really quickly, especially with Windows 10 and program loads will be a snap.

Within servers, the challenge is to make the CPU, rather than the rest of the system, the bottleneck. Adding SSDs as primary drives makes sense, with HDDs in older arrays doing duty as bulk secondary storage, just as with all-flash solutions, This idea has fleshed out into the hyperconverged infrastructure (HCI) concept where the drives in each node are shared with other servers in lieu of dedicated storage boxes. While HCI is a major philosophical change, the effort to get there isn’t that huge.

For the savvy storage admin, RAID arrays and iSCSI storage can both be turned into powerful object storage systems. Both support a JBOD (just a bunch of drives) mode, and if the JBODs are attached across a set of server nodes running “free” Ceph or Scality Ring software, the result is a decent object-storage solution, especially if compression and global deduplication are supported.

Likely by now, you are using public clouds for backup. Consider “perpetual “storage using a snapshot tool or continuous backup software to reduce your RPO and RTO. Use multi-zone operations in the public cloud to converge DR onto the perpetual storage setup, as part of a cloud-based DR process. Going to the cloud for backup should save a lot of capital expense money.

On the software front, the world of IT is migrating to a services-centric software-defined storage (SDS), which allows scaling and chaining of data services via a virtualized microservice concept. Even older SANs and server drives can be pulled into the methodology, with software making all legacy boxes in a data center operate as a single pool of storage. This simplifies storage management and makes data center storage more flexible.

Encryption ought to be added to any networked storage or backup. If this prevents even one hacker from reading your files in the next five years, you’ll look good! If you are running into a space crunch and the budget is tight, separate out your cold data, apply one of the “Zip” programs and choose the encrypted file option. This saves a lot of space and gives you encryption!

Let’s take a closer look at what you can do to transform your existing storage infrastructure and extend its life.

(Image: Production Perig/Shutterstock)



Source link

What NVMe over Fabrics Means for Data Storage


NVMe-oF will speed adoption of Non-Volatile Memory Express in the data center.

The last few years have seen Non-Volatile Memory Express (NVMe) completely revolutionize the storage industry. Its wide adoption has driven down flash memory prices. With lower prices and better performance, more enterprises and hyper-scale data centers are migrating to NVMe. The introduction of NVMe over Fabrics (NVMe-oF) promises to accelerate this trend.

The original base specification of NVMe is designed as a protocol for storage on flash memory that uses existing, unmodified PCIe as a local transport. This layered approach is very important. NVMe does not create a new electrical or frame layer; instead it takes advantage of what PCIe already offers. PCIe has a well-known history as a high speed interoperable bus technology. However, while it has those qualities, it’s not well suited for building a large storage fabric or covering distances longer than a few meters. With that limitation, NVMe would be limited to being used as a direct attached storage (DAS) technology, essentially connecting SSDs to the processor inside a server, or perhaps to connect all-flash arrays (AFA) within a rack. NVMe-oF allows things to be taken much further.

Connecting storage nodes over a fabric is important as it allows multiple paths to a given storage resource. It also enables concurrent operations to distributed storage, and a means to manage potential congestion. Further, it allows thousands of drives to be connected in a single pool of storage, since it is no longer limited by the reach of PCIe, but can also take advantage of a fabric technology like RoCE or Fibre Channel.

NVMe-oF describes a means of binding regular NVMe protocol over a chosen fabric technology, a simple abstraction enabling native NVMe commands to be transported over a fabric with minimal processing to map the fabric transport to PCIe and back.  Product demonstrations have shown that the latency penalty for accessing an NVMe SSD over a fabric as opposed to a direct PCIe link can be as low as 10 microseconds.

The layered approach means that a binding specification can be created for any fabric technology, although some fabrics may be better suited for certain applications. Today there are bindings for RDMA (RoCE, iWARP, Infiniband) and Fibre Channel. Work on a binding specification for TCP/IP has also begun.

Different products will use this layered capability in different ways. A simple NVMe-oF target, consisting of an array of NVMe SSDs, may expose all of its drives individually to the NVMe-oF host across the fabric, allowing the host to access and manage each drive individually. Other solutions may take a more integrated approach, using the drives within the array to create one big pool of storage offered that to the NVMe-oF initiator. With this approach, management of drives can be done locally within the array, without requiring the attention of the NVMe-oF initiator, or any higher layer software application. This also allows for the NVMe-oF target to implement and offer NVMe protocol features that may not be supported by drives within the array.

A good example of this is a secure erase feature. A lower cost drive may not support the feature, but if that drive is put into a NVMe-oF AFA target, the AFA can implement that secure erase feature and communicate to the initiator. The NVMe-oF target will handle the operations to the lower cost drive in order to properly support the feature from the perspective of the initiator. This provides implementers with a great deal of flexibility to meet customer needs by varying hardware vs. software feature implementation, drive cost, and performance.

The recent plugfest at UNH-IOL focused on testing simple RoCE and Fibre Channel fabrics. In these tests, a single initiator and target pair were connected over a simple two switch fabric. UNH-IOL performed NVMe protocol conformance testing, generating storage traffic  to ensure data could be transferred error-free. Additionally, testing involved inducing network disruptions to ensure the fabric could recover properly and transactions could resume.

In the data center, storage is used to support many different types of applications with an unending variety of workloads. NVMe-oF has been designed to enable flexibility in deployment, offering choices for drive cost and features support, local or remote management, and fabric connectivity. This flexibility will enable wide adoption. No doubt, we’ll continue to see expansion of the NVMe ecosystem.



Source link

How Spectre and Meltdown Impact Data Center Storage


IT news over the last few weeks has been dominated by stories of vulnerabilities found in Intel x86 chips and almost all modern processors. The two exposures, Spectre and Meltdown, are a result of the speculative execution that all CPUs use to anticipate the flow of execution of code and ensure that internal instruction pipelines are filled as optimally as possible. It’s been reported that Spectre/Meltdown can have an impact on I/O and that means storage products could be affected. So, what are the impacts and what should data center operators and storage pros do?

Speculative execution

Speculative execution is a performance-improvement process used by modern processors where instructions are executed before the processor knows whether they will be needed. Imagine some code that branches as the result of a logic comparison. Without speculative execution, the processor needs to wait for the completion of that logic comparison before continuing to read ahead, resulting in a drop in performance. Speculative execution allows both (or all) branches of the logic to be followed; those that aren’t executed are simply discarded and the processor is kept active.

Both Spectre and Meltdown pose the risk of unauthorized access to data in this speculative execution process. A more detailed breakdown of the problem is available in two papers covering the vulnerabilities (here and here). Vendors have released O/S and BIOS workarounds for the exposures. Meltdown fixes have noticeably impacted performance on systems with high I/O activity due to the extra code needed to isolate user and system memory during context switches (syscalls). Reports range from 5%-50% additional CPU overhead, depending on the specific platform and workload.

Storage repercussions

How could this impact storage appliances and software? Over the last few years, almost all storage appliances and arrays have migrated to the Intel x86 architecture. Many are now built on Linux or Unix kernels and that means they are directly impacted by the processor vulnerabilities, which if patched, result in increased system load and higher latency.

Software-defined storage products are also potentially impacted, as they run on generic operating systems like Linux and Windows. The same applies for virtual storage appliances run in VMs and hyperconverged infrastructure, and of course either public cloud storage instances or high-intensity I/O cloud applications. Quantifying the impact is difficult as it depends on the amount of system calls the storage software has to make. Some products may be more affected than others.  

Vendor response

Storage vendors have had mixed responses to the CPU vulnerabilities. For appliances or arrays that are deemed to be “closed systems” and not able to run user code, their stance is that these systems are unaffected and won’t be patched.

Where appliances can run external code like Pure Storage’s FlashArray, which can execute user code via a feature called Purity Run, there will be a need to patch. Similarly, end users running SDS solutions on generic operating systems will need to patch. HCI and hypervisor vendors have already started to make announcements about patching, although the results have been varied. VMware for instance, released a set of patches only to recommend not installing them due to customer issues. Intel’s advisory earlier this week warning of problems with its patches has added to the confusion.

Some vendors such as Dell EMC haven’t made public statements about the impact of the vulnerabilities for all of their products. For example, Dell legacy storage product information is openly available, while information about Dell EMC products is only available behind support firewalls. I guess if you’re a user of those platforms, then you will have access, however, for wider market context it would have been helpful to see a consolidated response in order to assess the risk.

Reliability

So far, the patches released don’t seem to be very stable. Some have been withdrawn, others have crashed machines or made them unbootable. Vendors are in a difficult position, because the details of the vulnerabilities weren’t widely circulated in the community before they subsequently were made public. Some storage vendors only found out about the issue when the news broke in the press. This means some of the patches may be being rushed to market without full testing of the impact when they are applied.

To patch or not?

What should end users do? First, it’s worth evaluating the risk and impact of either applying or not applying patches. Computers that are regularly exposed to the internet like desktops and public cloud instances (including virtual storage appliances running in a cloud instance)) are likely to be most at risk, whereas storage appliances behind a corporate firewall on a dedicated storage management network are at lowest risk. Measure this risk against the impact of applying the patches and what could go wrong. Applying patches to a storage platform supporting hundreds or thousands of users, for example, is a process that needs thinking through.

Action plan

Start by talking to your storage vendors. Ask them why they believe their platforms are exposed or not. Ask what testing of patching has been performed, from both a stability and performance perspective. If you have a lab environment, do some before/after testing with standard workloads. If you don’t have a lab, ask your vendor for support.

As there are no known exploits in the wild for Spectre/Meltdown, a wise approach is probably to wait a little before applying patches. Let the version 1 fixes be tested in the wild by other folks first. Invariably issues are found that then get corrected by another point release. Waiting a little also gives time for vendors to develop more efficient patches, rather than ones that simply act as a workaround. In any event, your approach will depend on your particular set of circumstances.



Source link