If your machine has a discrete accelerated GPU, then you'll probably need to set these environment variables in order to test LLVMpipe:
LLVMpipe is limited to 16 threads. The only easily findable justification for this limit is in a commit from 2013, where it was increased from 8 because a user reported on a mailing list that 16 was faster for them. Given that POWER9 systems often have much higher thread counts than this, this limit may be suboptimal for POWER9.
luke-jr anecdotally reports that bumping the limit to 128 threads noticeably improved performance in 3D games, e.g. Jedi Academy became playable. However, it also had the side effect that most GUI applications spawned 128 LLVMpipe threads, which was annoying in
top. Luke worked around this issue by setting the environment variable
LP_NUM_THREADS=2 globally, and overriding for 3D applications that needed more threads. Luke's patch is:
diff -ur mesa-17.3.9.orig/src/gallium/drivers/llvmpipe/lp_limits.h mesa-17.3.9/src/gallium/drivers/llvmpipe/lp_limits.h --- mesa-17.3.9.orig/src/gallium/drivers/llvmpipe/lp_limits.h 2018-04-18 04:44:00.000000000 -0400 +++ mesa-17.3.9/src/gallium/drivers/llvmpipe/lp_limits.h 2018-05-02 05:20:57.586000000 -0400 @@ -61,7 +61,7 @@ #define LP_MAX_WIDTH (1 << (LP_MAX_TEXTURE_LEVELS - 1)) -#define LP_MAX_THREADS 16 +#define LP_MAX_THREADS 128 /** Only in mesa-17.3.9/src/gallium/drivers/llvmpipe: lp_limits.h~
JeremyRand benchmarked OpenArena with LLVMpipe as per these instructions, and found that for 1920x1200 resolution on a 2x 4-core Talos II running Debian Bullseye, the following Mesa patch improved performance:
diff -ur stock/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h --- stock/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h 2021-03-24 14:10:48.744070300 -0500 +++ 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h 2022-07-02 04:08:46.880000000 -0500 @@ -66,7 +66,9 @@ #define LP_MAX_SAMPLES 4 -#define LP_MAX_THREADS 16 +// Bumped by Jeremy +//#define LP_MAX_THREADS 16 +#define LP_MAX_THREADS 32 /**
Before the 32-thread patch:
3398 frames 487.6 seconds 7.0 fps 83.0/143.5/7186.0/25.1 ms 3398 frames 472.7 seconds 7.2 fps 86.0/139.1/2242.0/22.9 ms 3398 frames 466.5 seconds 7.3 fps 86.0/137.3/943.0/21.1 ms 3398 frames 467.5 seconds 7.3 fps 83.0/137.6/846.0/22.4 ms 3398 frames 474.8 seconds 7.2 fps 86.0/139.7/779.0/22.8 ms
After the 32-thread patch:
3398 frames 417.7 seconds 8.1 fps 77.0/122.9/1748.0/18.9 ms 3398 frames 417.9 seconds 8.1 fps 76.0/123.0/997.0/19.6 ms 3398 frames 419.9 seconds 8.1 fps 76.0/123.6/806.0/19.8 ms 3398 frames 422.7 seconds 8.0 fps 75.0/124.4/758.0/21.1 ms 3398 frames 419.1 seconds 8.1 fps 75.0/123.3/730.0/20.4 ms
It would be desirable to get similar benchmark numbers from users with more than 2x 4-cores, to see what improvement is yielded by higher thread counts than 32.
As of 2022 June 10, grepping
altivec yields only 2 files (
lp_setup_tri.c, both of which are POWER8 LE), while grepping for
PIPE_ARCH_SSE yields 11 files. This seems to suggest that a lot of POWER vector optimizations are missing from LLVMpipe. POWER9 vector optimizations (the LLVM
power9-vector feature), and POWER8 BE optimizations, appear to be completely absent.
JeremyRand enabled build-time SSE intrinsics translation in the
calc_fixed_position function in
lp_setup_tri.c (which is the only function that uses SSE in Debian Bullseye but is missing an Altivec implementation in
main branch; other SSE-utilizing functions were added to
main after Debian Bullseye) via this patch:
diff -ur 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c 32-threads-simd-wip2/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c --- 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c 2021-03-24 14:10:48.746070400 -0500 +++ 32-threads-simd-wip2/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c 2022-07-02 05:22:44.160000000 -0500 @@ -49,9 +49,14 @@ #elif defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN #include <altivec.h> #include "util/u_pwr8.h" +// Emulate SSE +#define NO_WARN_X86_INTRINSICS +#define __m128i __x86__m128i +#include <emmintrin.h> +#undef __m128i #endif -#if !defined(PIPE_ARCH_SSE) +#if !(defined(PIPE_ARCH_SSE) || (defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN)) static inline int subpixel_snap(float a) @@ -1032,7 +1037,10 @@ * otherwise nearest/away-from-zero). * Both should be acceptable, I think. */ -#if defined(PIPE_ARCH_SSE) +#if defined(PIPE_ARCH_SSE) || (defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN) + #if defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN + #define __m128i __x86__m128i + #endif __m128 v0r, v1r; __m128 vxy0xy2, vxy1xy0; __m128i vxy0xy2i, vxy1xy0i; @@ -1061,6 +1069,9 @@ y0120 = _mm_unpackhi_epi32(x0x2y0y2, x1x0y1y0); _mm_store_si128((__m128i *)&position->x, x0120); _mm_store_si128((__m128i *)&position->y, y0120); + #if defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN + #undef __m128i + #endif #else position->x = subpixel_snap(v0 - pixel_offset);
When combined with the 32-thread patch from the above section, the following benchmarks in OpenArena were obtained:
3398 frames 418.5 seconds 8.1 fps 76.0/123.2/865.0/19.7 ms 3398 frames 414.7 seconds 8.2 fps 74.0/122.1/805.0/19.7 ms 3398 frames 418.8 seconds 8.1 fps 74.0/123.2/701.0/20.0 ms 3398 frames 424.0 seconds 8.0 fps 77.0/124.8/524.0/21.7 ms 3398 frames 414.1 seconds 8.2 fps 74.0/121.9/621.0/19.2 ms
This seems like an improvement, though a quite small one.