Tag Archives: Performance

Linux 5.3 Could Finally See FSGSBASE – Performance Improvements Back To Ivybridge


INTEL --

The FSGSBASE instruction set has been present on Intel processors going back to Ivy Bridge processors and while there have been Linux kernel patches for this feature going on for years, it looks like with the Linux 5.3 kernel cycle is this support for merging. Making us eager for this support is the prospect of better performance, especially for context switching workloads that already have been suffering as a result of recent CPU mitigations.

The FSGSBASE instructions allow for reading/writing FS/GS BASE from any privilege. But the short story is there should be performance benefits from FSGSBASE in context switching thanks to skipping an MSR write for GSBASE. User-space programs like Java are also expected to benefit in being able to avoid system calls for editing the FS/GS BASE.

Among the reasons the code has been delayed in previous years is that user-space can do stupid stuff, “The major disadvantage is that user code can use the new instructions. Now userspace is going to do totally stupid shite like writing some nonzero value to GS and then doing WRGSBASE or like linking some idiotic library that uses WRGSBASE into a perfectly innocent program like dosemu2 and resulting in utterly nonsensical descriptor state.

Considering all the performance hits we’ve seen the past year and a half from the likes of Meltdown and Zombieload, hearing of better context switching performance on an instruction set present since Ivybridge is certainly promising.

The FSGSBASE patches have been revised over the years on the mailing list while now the patches landed in WIP.x86/cpu maintained by Thomas Gleixner. Given this milestone, it’s looking quite likely we’ll see this x86 CPU improvement land with the upcoming Linux 5.3 merge window — barring any last minute objections. That next cycle is kicking off in early July.

Those wishing to learn more about the technical details can see the new documentation.


Mesa 19.2’s Virgl Sees Huge Performance Win Around Buffer Copy Transfers


MESA --

For those using Virgl to enjoy Gallium3D-based OpenGL acceleration to guest virtual machines on Linux, the Mesa 19.2 release paired with the latest Virgl renderer library should provide a very significant speed-up.

The virglrenderer code picked up support for copy transfers last month so the guest can avoid waiting if it needs to write to a busy resource. Alexandros Frantzis of Collabora who landed the Virglrenderer work has now seen his Mesa-side Virgl code merged to Mesa 19.2 Git.

Being able to avoid the waits by using a staging buffer range to guarantee it’s never busy provides a big performance advantage. Alexandros found one Steam Play Proton game (Twilight Struggle) running at about 7 FPS but now with this optimization is running at 25 FPS.

As another example, the OpenGL glmark2 basic test was running at 38 FPS but now runs at 331 FPS with this buffer copy transfer work.

The work makes for a damn fine addition in Mesa 19.2 for anyone leveraging Virgl.


GNOME Inching Closer To Better Wayland Multi-Monitor Performance


GNOME --

One of my personal biggest issues with using the GNOME Shell on Wayland has been the sluggish multi-monitor performance with driving dual 4K displays on my main workstation. Fortunately, GNOME is moving closer to resolving the fundamental issue and that could happen possibly with this current GNOME 3.34 cycle.

GNOME had been terribly slow in my own experiences with various multi-head setups when running on Wayland while under X11 the performance has been great. There has been some improvements with time that have made the experience more fluid, but that’s been in-step with general GNOME Shell / Mutter Wayland performance enhancements and other work. Fortunately, prolific GNOME developer Daniel Van Vugt at Canonical has been revisiting some of his open merge requests.

On top of seeing the work through last week to avoid frame skipping and lowering X11 output lag, he also was able to merge one of his prerequisite patches for ultimately moving towards improving the multi-monitor Wayland experience.

Merged was the referencing counting of front buffers, an MR that had been open for one year, and important for the future though not useful on its own today.

That reference counting is a stepping stone towards resolving Mutter’s big issue #3 from last year. That ticket is regarding the poor multi-monitor Wayland performance and finding out that half the time GNOME Shell spends is within meta_monitor_manager_kms_wait_for_flip when running two monitors. “With two monitors, gnome-shell spends around half of its (real) time blocked in meta_monitor_manager_kms_wait_for_flip…The problem is that with two separate displays they’re never going to be in phase with each other. Even if they’re the same frequency you could spend most of your time waiting for the condition of zero flips pending. Or 50% on average…I expect this bug explains multiple previous bug reports people have made about multi-monitor performance in Wayland sessions. Particularly when a busy or dragged window overlaps multiple monitors.

Now let’s hope that issue gets resolved by GNOME 3.34 in September.


Core i9 7980XE GCC 9 AVX Compiler Tuning Performance Benchmarks


GNU --

Continuing on with our benchmarks this month of the newly-released GCC 9 compiler, here are some additional numbers for the AVX-512-enabled Intel Core i9 7980XE processor on Ubuntu Linux when testing tuning for various AVX widths.

These latest Intel Core i9 benchmarks of the newly pressed GCC 9 compiler show the performance when running various C/C++ benchmarks after being built with CFLAGS/CXXFLAGS of “-O3 -march=skylake”, “-O3 -march=skylake-avx512 -mprefer-vector-width=128”, “-O3 -march=skylake-avx512 -mprefer-vector-width=256” and “-O3 -march=skylake-avx512 -mprefer-vector-width=512″

A wide assortment of compiler benchmarks were run via the Phoronix Test Suite.

Looking at dozens of benchmarks overall, using AVX-512 didn’t lead to the best results most often due to generally lower clock speeds but rather AVX and AVX2 still performed well although was a close call overall.

FFTW shows one of the larger impacts between runs with varying AVX widths.

In the case of Botan, the conventional Skylake target performed the best.

See more via this OpenBenchmarking.org result file for all the individual tests in full.


DragonFlyBSD Is Seeing Better Performance Following A Big VM Rework


BSD --

DragonFlyBSD lead developer Matthew Dillon has been reworking the virtual memory (VM) infrastructure within their kernel and it’s leading to measurable performance improvements.

This mailing list post outlines the work around the kernel’s VM pmap code being restructured that results in possible memory conservation, helps with processes sharing lots of memory, and enhances concurrent page fault performance. The performance bits are what we’re after and they appear to be quite compelling at least with Dillon’s testing so far on both big (Threadripper) and small (Raven Ridge) AMD test systems:

These changes significantly improve page fault performance, particularly under heavy concurrent loads.

* kernel overhead during the ‘synth everything’ bulk build is now under 15% system time. It used to be over 20%. (system time / (system time + user time)). Tested on the threadripper (32-core/64-thread).

* The heavy use of shared mmap()s across processes no longer multiplies the pv_entry use, saving a lot of memory. This can be particularly important for postgres.

* Concurrent page faults now have essentially no SMP lock contention and only four cache-line bounces for atomic ops per fault (something that we may now also be able to deal with with the new work as a basis).

* Zero-fill fault rate appears to max-out the CPU chip’s internal data busses, though there is still room for improvement. I top out at 6.4M zfod/sec (around 25 GBytes/sec worth of zero-fill faults) on the threadripper and I can’t seem to get it to go higher. Note that obviously there is a little more dynamic ram overhead than that from the executing kernel code, but still…

* Heavy concurrent exec rate on the TR (all 64 threads) for a shared dynamic binary increases from around 6000/sec to 45000/sec. This is actually important, because bulk builds

* Heavy concurrent exec rate on the TR for independent static binaries now caps out at around 450000 execs per second. Which is an insanely high number.

* Single-threaded page fault rate is still a bit wonky but hit 500K-700K faults/sec (2-3 GBytes/sec).

Small system comparison using a Ryzen 2400G (4-core/8-thread), release vs master (this includes other work that has gone into master since the last release, too):

* Single threaded exec rate (shared dynamic binary) – 3180/sec to 3650/sec

* Single threaded exec rate (independent static binary) – 10307/sec to 12443/sec

* Concurrent exec rate (shared dynamic binary x 8) – 15160/sec to 19600/sec

* Concurrent exec rate (independent static binary x 8) – 60800/sec to 78900/sec

* Single threaded zero-fill fault rate – 550K zfod/sec -> 604K zfod/sec

* Concurrent zero-fill fault rate (8 threads) – 1.2M zfod/sec -> 1.7M zfod/sec

* make -j 16 buildkernel test (tmpfs /usr/src, tmpfs /usr/obj):

4.4% improvement in overall time on the first run (6.2% improvement on subsequent runs). system% 15.6% down to 11.2% of total cpu seconds. This is a kernel overhead reduction of 31%. Note that the increased time on release is probably due to inefficient buffer cache recycling.

DragonFlyBSD appears on track for a great 2019 with their other recent accomplishments being prompt handling of the MDS/Zombieload mess,DRM code updates, HAMMER2 improvements, flipping on compiler-based Retpoline support, and FUSE work, among other coding activities.