Tag Archives: Performance

Linux Networking Performance To Improve Thanks To Retpoline Overhead Reduction


One of the areas where Linux performance has been lower this year since Spectre came to light has been for networking performance, but with the upcoming Linux 4.21 cycle that will be partially addressed.

Linux networking performance took a hit from the introduction of Retpolines “Return Trampolines” at the start of the year for addressing Spectre Variant Two.

Developer Paolo Abeni has been working to offset the Retpoline overhead with a new patch series now destined for Linux 4.21. From the patch series, “We can partially address that when the function pointer refers to a builtin symbol resorting to a list of tests vs well-known builtin function and direct calls. Experimental results show that replacing a single indirect call via retpoline with several branches and a direct call gives performance gains even when multiple branches are added – 5 or more…This may lead to some uglification around the indirect calls…Overall this gives [greater than] 10% performance improvement for UDP GRO benchmark and smaller but measurable for TCP syn flood.

Linux networking subsystem maintainer David Miller has already expressed his plans to pull this into net-next, making it material for the Linux 4.21 cycle.

Another Linux 4.20 Performance Regression Has Now Been Addressed (THP)


The bumpy Linux 4.19~4.20 road continues but at least another performance regression is now crossed off.

Google’s David Rientjes has landed a patch in mainline Linux 4.20 Git as of yesterday that restores node-locale hugepage allocations. Changes to Transparent Huge-Pages, which THP itself was designed to improve performance and make it easier to utilize huge-pages, had caused a performance regression to be introduced back during the 4.20 merge window.

In terms of the 4.20 performance regression, “On Haswell, [one of the problematic commits] was shown to have a 13.9% access regression after this commit for binaries that remap their text segment to be backed by transparent hugepages… If remote memory is also low or fragmented, not setting __GFP_THISNODE was also measured on Haswell to have a 40% regression in allocation latency.

More details on the changes to Transparent Huge-Pages via this kernel commit.

DragonFlyBSD 5.4 & FreeBSD 12.0 Performance Benchmarks, Comparison Against Linux

Coincidentally the DragonFlyBSD 5.4 release and FreeBSD 12.0 lined up to be within a few days of each other, so for an interesting round of benchmarking here is a look at DragonFlyBSD 5.4 vs. 5.2.2 and FreeBSD 12.0 vs. 11.2 on the same hardware as well as comparing those BSD operating system benchmark results against Ubuntu 18.04.1 LTS, Clear Linux, and CentOS 7 for some Linux baseline figures.

DragonFlyBSD 5.4 introduced NUMA optimizations, upgrading from GCC5 to GCC8 as the base compiler, HAMMER2 file-system improvements, and many other enhancements built up over the past half-year.

FreeBSD 12.0 meanwhile has upgraded its default LLVM Clang compiler, improves support for Threadripper/Ryzen 2 processors, deprecates many of its 10/100 network drivers, ext2fs now provides full read/write support for EXT4, a lot of new hardware support, and other improvements. FreeBSD 12.0 should be officially announced within the next few days while for the purposes of this testing was using 12.0-RC3, which is effectively the final build aside from any last-minute fixes.

Testing of these BSDs and Linux distributions were done on the same system (obviously) and consisted of an Intel Core i9 7980XE (18 cores / 36 threads at stock speeds), ASUS PRIME X299-A motherboard, 4 x 4GB DDR4-3200 memory, 240GB Corsair Force MP510 NVMe SSD, and GeForce GTX TITAN X graphics card. The operating systems were kept “out of the box” as much as possible to represent the default experience users will see in their vendor-supplied state. Highlights of the operating systems tested:

DragonFlyBSD 5.2.2 – The previous stable release of DragonFly, which shipped with the GCC 5.4.1 compiler and was installed with HAMMER2.

DragonFlyBSD 5.4.0 – The newly-minted DragonFlyBSD update that switches over to GCC 8.1 and many other updates in the process, including more mature HAMMER2 support.

FreeBSD 11.2 – The stock 11.2-RELEASE setup with ZFS and using the default Clang 6 compiler.

FreeBSD 12.0 – The RC3 release was tested with its default Clang 6.0.1 compiler and ZFS file-system.

FreeBSD 12.0 + GCC8 – While the FreeBSD camp remains steadfast with using LLVM/Clang over GCC, for those wondering how the performance changes when switching over to GCC, a secondary run was used with GCC 8.2 installed.

CentOS 7.6 – The current community RHEL7 release with its Linux 3.10 based kernel, GCC 4.8.5 compiler, and XFS file-system.

Clear Linux 26670 – Intel’s open-source Linux distribution that often sets the gold standard for Linux performance thanks to its many optimizations from patching of various packages to compiler tuning to a lot of tweaking for yielding incredible performance potential without much work/time by its users. Clear Linux 26670 relies upon Linux 4.19 and GCC 8.2.1 with the EXT4 file-system.

Ubuntu 18.04.1 – The current Ubuntu LTS release with Linux 4.15, GCC 7.3, and EXT4 file-system.

Coming up later this month will be a larger Linux vs. BSD server benchmark comparison done on dual-socket Intel Xeon and AMD EPYC hardware, which will include a more diverse range of distributions, with the purpose of this comparison on the Core i9 just to get an idea for the DragonFlyBSD/FreeBSD performance changes out of their new releases and a few Linux distributions for reference.

All of these BSD and Linux distribution benchmarks were carried out in a fully-automated and reproducible manner using the open-source Phoronix Test Suite benchmarking software.

The Radeon RX Vega Performance With AMDGPU DRM-Next 4.21 vs. NVIDIA Linux Gaming

Given the AMDGPU changes building up for DRM-Next to premiere in Linux 4.21 that is on top of the AMDGPU performance boost with Linux 4.20, here are some benchmarks of Linux 4.19 vs. 4.20 Git vs. DRM-Next (Linux 4.21 material) with the Radeon RX Vega 64 compared to the relevant NVIDIA GeForce competition.

The Radeon RX Vega 64 tests were done with Linux 4.19.5, Linux 4.20 Git as of Saturday afternoon, and DRM-Next-4.21-WIP from Alex Deucher’s Git tree as of Saturday for the latest Linux 4.21 material. The user-space drivers were Mesa 19.0-devel built against LLVM 8.0 SVN via the Padoka PPA. For judging the RX Vega 64 performance were the GeForce GTX 1070, GTX 1070 Ti, GTX 1080, and GTX 1080 Ti graphics cards as the closest competition to Vega. A fresh large graphics card comparison through the RTX 2080 series will be out in the next day or two. There will also be the Radeon RX 590 Linux review still once that graphics card is working appropriately with the driver stack.

The NVIDIA driver in use was 415.18 and all tests were done from the same Ubuntu 18.04 LTS box. All of these OpenGL and Vulkan Linux benchmarks were carried out in a fully-automated and reproducible manner using the open-source Phoronix Test Suite benchmarking software.

Reworked STIBP Code Lands In Linux 4.20 To Fix The Performance


The big Linux 4.20 performance slowdown is now corrected by tonight’s Linux 4.20 Git code while still providing reasonable security for cross-hyperthread Spectre V2 mitigation.

Spectre/Meltdown kernel patch wrangler Thomas Gleixner sent in his patch series this afternoon with a subject line of “Cure the STIBP fallout” and started the message with, “The performance destruction department finally got it’s act together and came up with a cure for the STIPB regression.” That cure is the reworked code around “Single Thread Indirect Branch Predictors.”

Rather than enabling STIBP for all processes, which had been done at the start of the Linux 4.20 kernel merge window and was a wreck for performance across many workloads as Phoronix was first to shine the light on this problem, by default it now just applies STIBP to processes opting into that functionality via the prctl interface and additionally for sandboxed processes by means of SECCOMP.

I’ve tested these patches and they indeed return Linux 4.20 to performing appropriately. More details on the background to these patches, the new tunables, and the performance change, see my recent article: Benchmarking The Work-In-Progress Spectre/STIBP Code On The Way For Linux 4.20.

Linus Torvalds quickly honored the pull request and the code is now in Git. The code is in place in time for tomorrow’s Linux 4.20-rc5 kernel to offer much better performance.

STIBP had been back-ported to the Linux stable branches only to be reverted due to the performance fallout. We’ll see how quickly now these revised STIBP implementation gets brought back to the stable series for cross-hyperthread Spectre V2 protection for processes needing it.