Tag Archives: Performance

Assess USB Performance While Exploring Storage Caching | Linux.com

The team here at the Dragon Propulsion Laboratory has kept busy building multiple Linux clusters as of late [1]. Some of the designs rely on spinning disks or SSD drives, whereas others use low-cost USB storage or even SD cards as boot media. In the process, I was hastily reminded of the limits of external storage media: not all flash is created equal, and in some crucial ways external drives, SD cards, and USB keys can be fundamentally different.

Turtles All the Way Down

Mass storage performance lags that of working memory in the Von Neumann architecture [2], with the need to persist data leading to the rise of caches at multiple levels in the memory hierarchy. An access speed gap three orders of magnitude between levels makes this design decision essentially inevitable where performance is at all a concern. (See Brendan Gregg’s table of computer speed in human time [3].) The operating system itself provides the most visible manifestation of this design in Linux: Any RAM not allocated to a running program is used by the kernel to cache the reads from and buffer the writes to the storage subsystem [4], leading to the often repeated quip that there is really no such thing as “free memory” in a Linux system.

An easy way to observe the operating system (OS) buffering a write operation is to write the right amount of data to a disk in a system with lots of RAM, as shown in Figure 1, in which a rather improbable half a gigabyte worth of zeros is being written to a generic, low-cost USB key in half a second, but then experiences a 30-second delay when forcing the system to sync [5] to disk. 

Read more at ADMIN magazine

GCC 8/9 vs. LLVM Clang 7/8 Compiler Performance On AArch64

With Clang 8.0 due out by month’s end and GCC 9 due for release not long after that point, this week we’ve been running a number of GCC and Clang compiler benchmarks on Phoronix. At the start of the month was the large Linux x86_64 GCC vs. Clang compiler benchmarks on twelve different Intel/AMD systems while last week was also a look at the POWER9 compiler performance on the Raptor Talos II. In this article we are checking out these open-source compilers’ performance on 64-bit ARM (AArch64) using an Ampere eMAG 32-core server.

For rounding out our look at the GCC performance across various architectures, for the ARMv8 compiler benchmarking I was using the Ampere eMAG 3.0GHz 32-core server that is the most interesting 64-bit ARM server we have available locally for testing. Fedora Server 29 AArch64 also loaded up nicely on this eMAG server for having a fresh toolchain and other software packages. This server had 128GB of RAM and a Samsung 850 256GB SSD for storage while the Fedora 29 installation updated led it to running off the fresh Linux 4.20 kernel.

Like with the other compiler comparisons this month, GCC 8.2.0, GCC 9.0.1 (snapshot 20190203, ahead of the stable GCC 9.1.0 release), LLVM Clang 7.0.1, and LLVM Clang 8.0-RC2 were used for benchmarking. All four of these compilers were built on this Ampere eMAG server and built in their release/optimized (non-debug) modes.

During the various benchmarks driven via the Phoronix Test Suite, the CFLAGS/CXXFLAGS were set to “-O3 -march=armv8-a+crypto+crc+aes+sha2” (the -march=native functionality seems to have regressed for GCC9 on AArch64, thus specifying these options manually for testing). Now let’s see how the GCC vs. Clang AArch64 Linux performance is looking across dozens of benchmarks.

More Benchmarks Of The Improved Linux Performance With Glibc 2.29

GNU --

Yesterday I posted some initial benchmarks looking at the performance improvements with Glibc 2.29, the newest feature release of the GNU C Library. Here are more benchmarks on eight different systems using Glibc 2.29 on Clear linux.

With Clear Linux being the first distribution with Glibc 2.29 readily available, here are more performance tests of this rolling-release distribution before/after the Glibc 2.29 upgrade on an assortment of eight different Intel systems of varying generations.

All of the benchmarks, of course, carried out via the Phoronix Test Suite. This round-up of data is complementary to yesterday’s article.

With the FLAC and LAME MP3 encoding performance, which also improved in yesterday’s tests with the Core i9 7980XE, that appears to be as a result of AVX-512 optimizations based upon this data set… The Xeon Silver 4108 used in the testing does support AVX-512 and these single-threaded audio encoding tests seem to do a lot better in this case with Glibc 2.29 over Glibc 2.28.

Across the board improvements were found with the R benchmark, the statistical computing language.

Some operations with the Glibc micro-benchmarks like the square root function were faster on Glibc 2.29.

Another real-world test translating to performance improvements across the board was SciKit-Learn.

If you didn’t yet read yesterday’s article, be sure to see those Glibc 2.29 benchmarks for additional tests.

Netflix Continues Experiencing Great Performance In Using FreeBSD For Their CDN

BSD --

It’s been a love affair going on for years, but should you not already know, Netflix has long been leveraging FreeBSD as part of its in-house content delivery network (CDN) for serving its millions of users with on-demand video. This weekend at FOSDEM, Jonathan Looney of the company talked about their usage of FreeBSD.

Netflix remains one of the big FreeBSD studios and continues leveraging that BSD operating system for its network performance on their “Open Connect” CDN. What is even more unique about their FreeBSD setup is they closely track the CURRENT/head version of FreeBSD rather than sticking to the stable releases.

With FreeBSD on commodity server hardware they are able to achieve 90 Gb/s serving on TLS-encrypted connections with not even full CPU utilization. They rely upon the very latest FreeBSD code in order to stay up-to-date with bleeding-edge features and capabilities. Netflix also tries to upstream their FreeBSD changes where deemed suitable.

Those wanting to learn more about the Netflix usage of FreeBSD that missed out on the presentation at FOSDEM 2019 in Brussels, there is the PDF slide deck available.

AMDGPU-PRO 18.50 vs. ROCm 2.0 OpenCL Performance

When recently publishing the PlaidML deep learning benchmarks and lczero chess neural network OpenCL tests, some Phoronix readers mentioned they were seeing vastly different results with using the PAL OpenCL driver in AMDGPU-PRO (Radeon Software) compared to using the ROCm compute stack. So for seeing how those two separate AMD OpenCL drivers compare, here are some benchmark results with a Vega GPU while testing ROCm 2.0 and AMDGPU-PRO 18.50.

The Radeon Software AMDGPU-PRO 18.50 PAL OpenCL driver was benchmarks followed by various tests while using the ROCm 2.0 OpenCL compute driver. ROCm 2.0 is ultimately more full-featured than the former OpenCL driver code but there is quite a large difference in performance depending upon the workload, both for better and worse.

Tests were done with a Radeon RX Vega 64 graphics card on an AMD Ryzen Threadripper 2990WX box running Ubuntu 18.04 LTS with the stock Linux 4.15 kernel. The OpenCL GPU benchmarking was carried out using the open-source Phoronix Test Suite benchmarking software.