Difference between revisions of "Porting/LLVMpipe"

From RCS Wiki
Jump to navigation Jump to search
(→‎Thread Count: Add OpenArena benchmark)
 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
== Testing Notes ==
 +
 +
If your machine has a discrete accelerated GPU, then you'll probably need to set these environment variables in order to test LLVMpipe:
 +
 +
export LIBGL_ALWAYS_SOFTWARE=true
 +
export GALLIUM_DRIVER=llvmpipe
 +
 +
Potentially useful benchmarking tools:
 +
 +
* [https://github.com/glmark2/glmark2 glmark2] (seems to be not very useful at noticing small LLVMpipe optimizations)
 +
* [https://forum.khadas.com/t/vim3-gaming-with-panfrost/11636/4 OpenArena] (seems to work okay for testing LLVMpipe optimizations), to run benchmark: <code>openarena +timedemo 1 +cg_drawfps 1 +demo demo088-test1.dm_71</code>
 +
* [[Porting/Xonotic#Benchmarking|Xonotic]] (haven't tried it yet)
 +
* [https://github.com/phoronix-test-suite/phoronix-test-suite/blob/master/ob-cache/test-suites/pts/desktop-graphics-1.3.0/suite-definition.xml list of desktop graphics tests run by Phoronix]
 +
 
== Thread Count ==
 
== Thread Count ==
  
 
[https://docs.mesa3d.org/drivers/llvmpipe.html LLVMpipe] is [https://gitlab.freedesktop.org/mesa/mesa/-/blob/19682028eb0a2143c18ab2a26f3b23b7f74b2335/src/gallium/drivers/llvmpipe/lp_limits.h#L69 limited to 16 threads].  The only easily findable justification for this limit is in [https://gitlab.freedesktop.org/mesa/mesa/-/commit/38a751cbe85b7e31925931dc4994e7def5e5af96 a commit from 2013], where it was increased from 8 because a user reported on a mailing list that 16 was faster for them.  Given that POWER9 systems often have much higher thread counts than this, this limit may be suboptimal for POWER9.
 
[https://docs.mesa3d.org/drivers/llvmpipe.html LLVMpipe] is [https://gitlab.freedesktop.org/mesa/mesa/-/blob/19682028eb0a2143c18ab2a26f3b23b7f74b2335/src/gallium/drivers/llvmpipe/lp_limits.h#L69 limited to 16 threads].  The only easily findable justification for this limit is in [https://gitlab.freedesktop.org/mesa/mesa/-/commit/38a751cbe85b7e31925931dc4994e7def5e5af96 a commit from 2013], where it was increased from 8 because a user reported on a mailing list that 16 was faster for them.  Given that POWER9 systems often have much higher thread counts than this, this limit may be suboptimal for POWER9.
  
[[User:Luke-jr|luke-jr]] anecdotally reports that bumping the limit to 128 threads noticeably improved performance in 3D games, e.g. Jedi Academy became playable.  However, it also had the side effect that most GUI applications spawned 128 LLVMpipe threads, which was annoying in <code>gdb</code>/<code>top</code>.  Luke worked around this issue by setting the environment variable <code>LP_NUM_THREADS=2</code> globally, and overriding for 3D applications that needed more threads.  Luke's patch is:
+
[[User:Luke-jr|luke-jr]] reports that bumping the limit to 128 threads noticeably improved performance in 3D games, e.g. Jedi Academy (detail set to High, Texture to Very High, Texture Filter BILINEAR, Detailed Shaders ON, Video Sync OFF) went from ~15 fps with 16 threads to ~25 fps with 64 threads (on a 2x 8-core Talos II).  However, it also had the side effect that most GUI applications spawned more LLVMpipe threads, which was annoying in <code>gdb</code>/<code>top</code>.  Luke worked around this issue by setting the environment variable <code>LP_NUM_THREADS=2</code> globally, and overriding for 3D applications that needed more threads.  Luke's patch is:
  
 
  diff -ur mesa-17.3.9.orig/src/gallium/drivers/llvmpipe/lp_limits.h mesa-17.3.9/src/gallium/drivers/llvmpipe/lp_limits.h
 
  diff -ur mesa-17.3.9.orig/src/gallium/drivers/llvmpipe/lp_limits.h mesa-17.3.9/src/gallium/drivers/llvmpipe/lp_limits.h
Line 19: Line 33:
 
  Only in mesa-17.3.9/src/gallium/drivers/llvmpipe: lp_limits.h~
 
  Only in mesa-17.3.9/src/gallium/drivers/llvmpipe: lp_limits.h~
  
[[User:JeremyRand|JeremyRand]] benchmarked OpenArena with LLVMpipe as per [https://forum.khadas.com/t/vim3-gaming-with-panfrost/11636/4 these instructions], and found that for 1920x1200 resolution on a 2x 4-core Talos II running Debian Bullseye, the following Mesa patch improved performance:
+
[[User:JeremyRand|JeremyRand]] benchmarked OpenArena with LLVMpipe, and found that for 1920x1200 resolution on a 2x 4-core Talos II running Debian Bullseye, the following Mesa patch improved performance:
  
 
  diff -ur stock/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h
 
  diff -ur stock/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h
Line 51: Line 65:
 
  3398 frames 422.7 seconds 8.0 fps 75.0/124.4/758.0/21.1 ms
 
  3398 frames 422.7 seconds 8.0 fps 75.0/124.4/758.0/21.1 ms
 
  3398 frames 419.1 seconds 8.1 fps 75.0/123.3/730.0/20.4 ms
 
  3398 frames 419.1 seconds 8.1 fps 75.0/123.3/730.0/20.4 ms
 +
 +
[[User:Nashimus|Nashimus]] tested <code>#define LP_MAX_THREADS 144</code> on a 2x 18-core Talos II, with the following results:
 +
 +
16 threads:                3398 frames 312.5 seconds 10.9 fps 55.0/92.0/8375.0/21.2 ms
 +
64 threads:                3398 frames 221.5 seconds 15.3 fps 33.0/65.2/8390.0/21.4 ms
 +
144 threads:                3398 frames 208.5 seconds 16.3 fps 30.0/61.4/8364.0/21.8 ms
 +
 +
It would be desirable to compare 32 threads to 64 threads on the same setup so that Jeremy's results and Nashimus's results can be more directly compared.
  
 
== Vector Optimizations ==
 
== Vector Optimizations ==
Line 56: Line 78:
 
As of 2022 June 10, grepping <code>main</code> branch <code>src/gallium/drivers/llvmpipe/</code> for <code>altivec</code> yields only 2 files (<code>lp_rast_tri.c</code> and <code>lp_setup_tri.c</code>, both of which are POWER8 LE), while grepping for <code>PIPE_ARCH_SSE</code> yields 11 files.  This seems to suggest that a lot of [[Power_ISA/Vector_Operations|POWER vector optimizations]] are missing from LLVMpipe.  POWER9 vector optimizations (the LLVM <code>power9-vector</code> feature), and POWER8 BE optimizations, appear to be completely absent.
 
As of 2022 June 10, grepping <code>main</code> branch <code>src/gallium/drivers/llvmpipe/</code> for <code>altivec</code> yields only 2 files (<code>lp_rast_tri.c</code> and <code>lp_setup_tri.c</code>, both of which are POWER8 LE), while grepping for <code>PIPE_ARCH_SSE</code> yields 11 files.  This seems to suggest that a lot of [[Power_ISA/Vector_Operations|POWER vector optimizations]] are missing from LLVMpipe.  POWER9 vector optimizations (the LLVM <code>power9-vector</code> feature), and POWER8 BE optimizations, appear to be completely absent.
  
 +
[[User:JeremyRand|JeremyRand]] enabled [https://www.talospace.com/2019/07/easier-power-vectorizing-for-fun-and.html build-time SSE intrinsics translation] in the <code>calc_fixed_position</code> function in <code>lp_setup_tri.c</code> (which is the only function that uses SSE in Debian Bullseye but is missing an Altivec implementation in <code>main</code> branch; other SSE-utilizing functions were added to <code>main</code> after Debian Bullseye) via this patch:
 +
 +
diff -ur 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c 32-threads-simd-wip2/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c
 +
--- 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c  2021-03-24 14:10:48.746070400 -0500
 +
+++ 32-threads-simd-wip2/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c        2022-07-02 05:22:44.160000000 -0500
 +
@@ -49,9 +49,14 @@
 +
  #elif defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN
 +
  #include <altivec.h>
 +
  #include "util/u_pwr8.h"
 +
+// Emulate SSE
 +
+#define NO_WARN_X86_INTRINSICS
 +
+#define __m128i __x86__m128i
 +
+#include <emmintrin.h>
 +
+#undef __m128i
 +
  #endif
 +
 
 +
-#if !defined(PIPE_ARCH_SSE)
 +
+#if !(defined(PIPE_ARCH_SSE) || (defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN))
 +
 
 +
  static inline int
 +
  subpixel_snap(float a)
 +
@@ -1032,7 +1037,10 @@
 +
      * otherwise nearest/away-from-zero).
 +
      * Both should be acceptable, I think.
 +
      */
 +
-#if defined(PIPE_ARCH_SSE)
 +
+#if defined(PIPE_ARCH_SSE) || (defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN)
 +
+  #if defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN
 +
+      #define __m128i __x86__m128i
 +
+  #endif
 +
    __m128 v0r, v1r;
 +
    __m128 vxy0xy2, vxy1xy0;
 +
    __m128i vxy0xy2i, vxy1xy0i;
 +
@@ -1061,6 +1069,9 @@
 +
    y0120 = _mm_unpackhi_epi32(x0x2y0y2, x1x0y1y0);
 +
    _mm_store_si128((__m128i *)&position->x[0], x0120);
 +
    _mm_store_si128((__m128i *)&position->y[0], y0120);
 +
+  #if defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN
 +
+      #undef __m128i
 +
+  #endif
 +
 
 +
  #else
 +
    position->x[0] = subpixel_snap(v0[0][0] - pixel_offset);
 +
 +
When combined with the 32-thread patch from the above section, Jeremy obtained the following benchmarks in OpenArena:
 +
 +
32 threads without SIMD patch:
 +
 +
3398 frames 417.7 seconds 8.1 fps 77.0/122.9/1748.0/18.9 ms
 +
3398 frames 417.9 seconds 8.1 fps 76.0/123.0/997.0/19.6 ms
 +
3398 frames 419.9 seconds 8.1 fps 76.0/123.6/806.0/19.8 ms
 +
3398 frames 422.7 seconds 8.0 fps 75.0/124.4/758.0/21.1 ms
 +
3398 frames 419.1 seconds 8.1 fps 75.0/123.3/730.0/20.4 ms
 +
 +
32 threads with SIMD patch:
 +
 +
3398 frames 418.5 seconds 8.1 fps 76.0/123.2/865.0/19.7 ms
 +
3398 frames 414.7 seconds 8.2 fps 74.0/122.1/805.0/19.7 ms
 +
3398 frames 418.8 seconds 8.1 fps 74.0/123.2/701.0/20.0 ms
 +
3398 frames 424.0 seconds 8.0 fps 77.0/124.8/524.0/21.7 ms
 +
3398 frames 414.1 seconds 8.2 fps 74.0/121.9/621.0/19.2 ms
 +
 +
This seems like an improvement, though a quite small one.
 +
 +
[[User:Nashimus|Nashimus]] tested the above <code>lp_setup_tri.c</code> patch in OpenArena with different thread counts:
 +
 +
16 threads:                3398 frames 312.5 seconds 10.9 fps 55.0/92.0/8375.0/21.2 ms
 +
16 threads and SIMD patch:  3398 frames 309.1 seconds 11.0 fps 54.0/91.0/8421.0/20.9 ms
 +
64 threads:                3398 frames 221.5 seconds 15.3 fps 33.0/65.2/8390.0/21.4 ms
 +
64 threads and SIMD patch:  3398 frames 222.3 seconds 15.3 fps 35.0/65.4/8464.0/21.4 ms
 +
144 threads:                3398 frames 208.5 seconds 16.3 fps 30.0/61.4/8364.0/21.8 ms
 +
144 threads and SIMD patch: 3398 frames 208.7 seconds 16.3 fps 30.0/61.4/8327.0/21.8 ms
 +
 +
It would be desirable to create and test similar patches for SSE-based non-Altivec functions added between Debian Bullseye and current <code>main</code>.
 +
 +
[[User:Nashimus|Nashimus]] tested [[File:LLVMpipe-SIMD-bookworm.patch]] in OpenArena with different thread counts. Starting from
 +
[https://gitlab.freedesktop.org/mesa/mesa/-/commit/b91971c240d3b8391f2105337579a0e14116769c Mesa commit b91971c2] on Ubuntu 22.04:
 +
:
 +
   
 +
    32 threads (14.92 avg fps):
 +
        3398 frames 224.7 seconds 15.1 fps 36.0/66.1/2483.0/14.4 ms
 +
        3398 frames 229.7 seconds 14.8 fps 38.0/67.6/1533.0/15.1 ms
 +
        3398 frames 227.6 seconds 14.9 fps 39.0/67.0/1522.0/13.9 ms
 +
        3398 frames 227.9 seconds 14.9 fps 39.0/67.1/1533.0/14.1 ms
 +
        3398 frames 227.7 seconds 14.9 fps 39.0/67.0/1551.0/14.2 ms
 +
   
 +
    32 threads and SIMD patch (14.84 avg fps):
 +
        3398 frames 230.6 seconds 14.7 fps 40.0/67.9/2080.0/18.1 ms
 +
        3398 frames 230.1 seconds 14.8 fps 39.0/67.7/1711.0/14.7 ms
 +
        3398 frames 227.0 seconds 15.0 fps 38.0/66.8/1485.0/14.7 ms
 +
        3398 frames 230.2 seconds 14.8 fps 36.0/67.7/1482.0/14.8 ms
 +
        3398 frames 228.7 seconds 14.9 fps 40.0/67.3/1522.0/14.6 ms
 +
   
 +
    144 threads (17.68 avg fps):
 +
        3398 frames 193.9 seconds 17.5 fps 31.0/57.1/2141.0/16.1 ms
 +
        3398 frames 191.3 seconds 17.8 fps 30.0/56.3/1743.0/14.4 ms
 +
        3398 frames 191.5 seconds 17.7 fps 30.0/56.4/1526.0/14.6 ms
 +
        3398 frames 192.3 seconds 17.7 fps 30.0/56.6/1528.0/14.0 ms
 +
        3398 frames 191.8 seconds 17.7 fps 29.0/56.4/1545.0/13.7 ms
 +
   
 +
    144 threads and SIMD patch (17.42 avg fps):
 +
        3398 frames 206.8 seconds 16.4 fps 29.0/60.9/8731.0/21.7 ms
 +
        3398 frames 195.5 seconds 17.4 fps 30.0/57.5/1710.0/14.9 ms
 +
        3398 frames 190.6 seconds 17.8 fps 30.0/56.1/1538.0/14.6 ms
 +
        3398 frames 191.6 seconds 17.7 fps 31.0/56.4/1587.0/14.5 ms
 +
        3398 frames 191.4 seconds 17.8 fps 31.0/56.3/1510.0/13.6 ms
 
[[Category:Ports]]
 
[[Category:Ports]]

Latest revision as of 16:53, 9 July 2022

Testing Notes

If your machine has a discrete accelerated GPU, then you'll probably need to set these environment variables in order to test LLVMpipe:

export LIBGL_ALWAYS_SOFTWARE=true
export GALLIUM_DRIVER=llvmpipe

Potentially useful benchmarking tools:

Thread Count

LLVMpipe is limited to 16 threads. The only easily findable justification for this limit is in a commit from 2013, where it was increased from 8 because a user reported on a mailing list that 16 was faster for them. Given that POWER9 systems often have much higher thread counts than this, this limit may be suboptimal for POWER9.

luke-jr reports that bumping the limit to 128 threads noticeably improved performance in 3D games, e.g. Jedi Academy (detail set to High, Texture to Very High, Texture Filter BILINEAR, Detailed Shaders ON, Video Sync OFF) went from ~15 fps with 16 threads to ~25 fps with 64 threads (on a 2x 8-core Talos II). However, it also had the side effect that most GUI applications spawned more LLVMpipe threads, which was annoying in gdb/top. Luke worked around this issue by setting the environment variable LP_NUM_THREADS=2 globally, and overriding for 3D applications that needed more threads. Luke's patch is:

diff -ur mesa-17.3.9.orig/src/gallium/drivers/llvmpipe/lp_limits.h mesa-17.3.9/src/gallium/drivers/llvmpipe/lp_limits.h
--- mesa-17.3.9.orig/src/gallium/drivers/llvmpipe/lp_limits.h	2018-04-18 04:44:00.000000000 -0400
+++ mesa-17.3.9/src/gallium/drivers/llvmpipe/lp_limits.h	2018-05-02 05:20:57.586000000 -0400
@@ -61,7 +61,7 @@
 #define LP_MAX_WIDTH  (1 << (LP_MAX_TEXTURE_LEVELS - 1))
 
 
-#define LP_MAX_THREADS 16
+#define LP_MAX_THREADS 128
 
 
 /**
Only in mesa-17.3.9/src/gallium/drivers/llvmpipe: lp_limits.h~

JeremyRand benchmarked OpenArena with LLVMpipe, and found that for 1920x1200 resolution on a 2x 4-core Talos II running Debian Bullseye, the following Mesa patch improved performance:

diff -ur stock/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h
--- stock/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h  2021-03-24 14:10:48.744070300 -0500
+++ 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h     2022-07-02 04:08:46.880000000 -0500
@@ -66,7 +66,9 @@
 
 #define LP_MAX_SAMPLES 4
 
-#define LP_MAX_THREADS 16
+// Bumped by Jeremy
+//#define LP_MAX_THREADS 16
+#define LP_MAX_THREADS 32
 
 
 /**

Before the 32-thread patch:

3398 frames 487.6 seconds 7.0 fps 83.0/143.5/7186.0/25.1 ms
3398 frames 472.7 seconds 7.2 fps 86.0/139.1/2242.0/22.9 ms
3398 frames 466.5 seconds 7.3 fps 86.0/137.3/943.0/21.1 ms
3398 frames 467.5 seconds 7.3 fps 83.0/137.6/846.0/22.4 ms
3398 frames 474.8 seconds 7.2 fps 86.0/139.7/779.0/22.8 ms

After the 32-thread patch:

3398 frames 417.7 seconds 8.1 fps 77.0/122.9/1748.0/18.9 ms
3398 frames 417.9 seconds 8.1 fps 76.0/123.0/997.0/19.6 ms
3398 frames 419.9 seconds 8.1 fps 76.0/123.6/806.0/19.8 ms
3398 frames 422.7 seconds 8.0 fps 75.0/124.4/758.0/21.1 ms
3398 frames 419.1 seconds 8.1 fps 75.0/123.3/730.0/20.4 ms

Nashimus tested #define LP_MAX_THREADS 144 on a 2x 18-core Talos II, with the following results:

16 threads:                 3398 frames 312.5 seconds 10.9 fps 55.0/92.0/8375.0/21.2 ms
64 threads:                 3398 frames 221.5 seconds 15.3 fps 33.0/65.2/8390.0/21.4 ms
144 threads:                3398 frames 208.5 seconds 16.3 fps 30.0/61.4/8364.0/21.8 ms

It would be desirable to compare 32 threads to 64 threads on the same setup so that Jeremy's results and Nashimus's results can be more directly compared.

Vector Optimizations

As of 2022 June 10, grepping main branch src/gallium/drivers/llvmpipe/ for altivec yields only 2 files (lp_rast_tri.c and lp_setup_tri.c, both of which are POWER8 LE), while grepping for PIPE_ARCH_SSE yields 11 files. This seems to suggest that a lot of POWER vector optimizations are missing from LLVMpipe. POWER9 vector optimizations (the LLVM power9-vector feature), and POWER8 BE optimizations, appear to be completely absent.

JeremyRand enabled build-time SSE intrinsics translation in the calc_fixed_position function in lp_setup_tri.c (which is the only function that uses SSE in Debian Bullseye but is missing an Altivec implementation in main branch; other SSE-utilizing functions were added to main after Debian Bullseye) via this patch:

diff -ur 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c 32-threads-simd-wip2/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c
--- 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c  2021-03-24 14:10:48.746070400 -0500
+++ 32-threads-simd-wip2/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_setup_tri.c        2022-07-02 05:22:44.160000000 -0500
@@ -49,9 +49,14 @@
 #elif defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN
 #include <altivec.h>
 #include "util/u_pwr8.h"
+// Emulate SSE
+#define NO_WARN_X86_INTRINSICS
+#define __m128i __x86__m128i
+#include <emmintrin.h>
+#undef __m128i
 #endif
 
-#if !defined(PIPE_ARCH_SSE)
+#if !(defined(PIPE_ARCH_SSE) || (defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN))
 
 static inline int
 subpixel_snap(float a)
@@ -1032,7 +1037,10 @@
     * otherwise nearest/away-from-zero).
     * Both should be acceptable, I think.
     */
-#if defined(PIPE_ARCH_SSE)
+#if defined(PIPE_ARCH_SSE) || (defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN)
+   #if defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN
+      #define __m128i __x86__m128i
+   #endif
    __m128 v0r, v1r;
    __m128 vxy0xy2, vxy1xy0;
    __m128i vxy0xy2i, vxy1xy0i;
@@ -1061,6 +1069,9 @@
    y0120 = _mm_unpackhi_epi32(x0x2y0y2, x1x0y1y0);
    _mm_store_si128((__m128i *)&position->x[0], x0120);
    _mm_store_si128((__m128i *)&position->y[0], y0120);
+   #if defined(_ARCH_PWR8) && UTIL_ARCH_LITTLE_ENDIAN
+      #undef __m128i
+   #endif
 
 #else
    position->x[0] = subpixel_snap(v0[0][0] - pixel_offset);

When combined with the 32-thread patch from the above section, Jeremy obtained the following benchmarks in OpenArena:

32 threads without SIMD patch:

3398 frames 417.7 seconds 8.1 fps 77.0/122.9/1748.0/18.9 ms
3398 frames 417.9 seconds 8.1 fps 76.0/123.0/997.0/19.6 ms
3398 frames 419.9 seconds 8.1 fps 76.0/123.6/806.0/19.8 ms
3398 frames 422.7 seconds 8.0 fps 75.0/124.4/758.0/21.1 ms
3398 frames 419.1 seconds 8.1 fps 75.0/123.3/730.0/20.4 ms

32 threads with SIMD patch:

3398 frames 418.5 seconds 8.1 fps 76.0/123.2/865.0/19.7 ms
3398 frames 414.7 seconds 8.2 fps 74.0/122.1/805.0/19.7 ms
3398 frames 418.8 seconds 8.1 fps 74.0/123.2/701.0/20.0 ms
3398 frames 424.0 seconds 8.0 fps 77.0/124.8/524.0/21.7 ms
3398 frames 414.1 seconds 8.2 fps 74.0/121.9/621.0/19.2 ms

This seems like an improvement, though a quite small one.

Nashimus tested the above lp_setup_tri.c patch in OpenArena with different thread counts:

16 threads:                 3398 frames 312.5 seconds 10.9 fps 55.0/92.0/8375.0/21.2 ms
16 threads and SIMD patch:  3398 frames 309.1 seconds 11.0 fps 54.0/91.0/8421.0/20.9 ms
64 threads:                 3398 frames 221.5 seconds 15.3 fps 33.0/65.2/8390.0/21.4 ms
64 threads and SIMD patch:  3398 frames 222.3 seconds 15.3 fps 35.0/65.4/8464.0/21.4 ms
144 threads:                3398 frames 208.5 seconds 16.3 fps 30.0/61.4/8364.0/21.8 ms
144 threads and SIMD patch: 3398 frames 208.7 seconds 16.3 fps 30.0/61.4/8327.0/21.8 ms

It would be desirable to create and test similar patches for SSE-based non-Altivec functions added between Debian Bullseye and current main.

Nashimus tested File:LLVMpipe-SIMD-bookworm.patch in OpenArena with different thread counts. Starting from Mesa commit b91971c2 on Ubuntu 22.04:

   32 threads (14.92 avg fps):
       3398 frames 224.7 seconds 15.1 fps 36.0/66.1/2483.0/14.4 ms
       3398 frames 229.7 seconds 14.8 fps 38.0/67.6/1533.0/15.1 ms
       3398 frames 227.6 seconds 14.9 fps 39.0/67.0/1522.0/13.9 ms
       3398 frames 227.9 seconds 14.9 fps 39.0/67.1/1533.0/14.1 ms
       3398 frames 227.7 seconds 14.9 fps 39.0/67.0/1551.0/14.2 ms
   
   32 threads and SIMD patch (14.84 avg fps):
       3398 frames 230.6 seconds 14.7 fps 40.0/67.9/2080.0/18.1 ms
       3398 frames 230.1 seconds 14.8 fps 39.0/67.7/1711.0/14.7 ms
       3398 frames 227.0 seconds 15.0 fps 38.0/66.8/1485.0/14.7 ms
       3398 frames 230.2 seconds 14.8 fps 36.0/67.7/1482.0/14.8 ms
       3398 frames 228.7 seconds 14.9 fps 40.0/67.3/1522.0/14.6 ms
   
   144 threads (17.68 avg fps):
       3398 frames 193.9 seconds 17.5 fps 31.0/57.1/2141.0/16.1 ms
       3398 frames 191.3 seconds 17.8 fps 30.0/56.3/1743.0/14.4 ms
       3398 frames 191.5 seconds 17.7 fps 30.0/56.4/1526.0/14.6 ms
       3398 frames 192.3 seconds 17.7 fps 30.0/56.6/1528.0/14.0 ms
       3398 frames 191.8 seconds 17.7 fps 29.0/56.4/1545.0/13.7 ms
   
   144 threads and SIMD patch (17.42 avg fps):
       3398 frames 206.8 seconds 16.4 fps 29.0/60.9/8731.0/21.7 ms
       3398 frames 195.5 seconds 17.4 fps 30.0/57.5/1710.0/14.9 ms
       3398 frames 190.6 seconds 17.8 fps 30.0/56.1/1538.0/14.6 ms
       3398 frames 191.6 seconds 17.7 fps 31.0/56.4/1587.0/14.5 ms
       3398 frames 191.4 seconds 17.8 fps 31.0/56.3/1510.0/13.6 ms