Difference between revisions of "Porting/LLVMpipe"
JeremyRand (talk | contribs) (→Thread Count: Add chatlog) |
JeremyRand (talk | contribs) (→Other linear rasterizer functions: apitrace suppresses lp_linear) |
||
(18 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
export GALLIUM_DRIVER=llvmpipe | export GALLIUM_DRIVER=llvmpipe | ||
− | Potentially useful benchmarking tools | + | === Potentially useful benchmarking tools === |
* [https://github.com/glmark2/glmark2 glmark2] (seems to be not very useful at noticing small LLVMpipe optimizations) | * [https://github.com/glmark2/glmark2 glmark2] (seems to be not very useful at noticing small LLVMpipe optimizations) | ||
* [https://forum.khadas.com/t/vim3-gaming-with-panfrost/11636/4 OpenArena] (seems to work okay for testing LLVMpipe optimizations). | * [https://forum.khadas.com/t/vim3-gaming-with-panfrost/11636/4 OpenArena] (seems to work okay for testing LLVMpipe optimizations). | ||
− | ** Download [https://github.com/JeremyRand/llvmpipe-multithreaded-openarena-benchmark this benchmarking script], run | + | ** Download [https://github.com/JeremyRand/llvmpipe-multithreaded-openarena-benchmark this benchmarking script]. To compare FPS with different thread counts, run <code>./benchmark.sh</code>; to measure which functions are potential bottlenecks, run <code>./perf.sh</code>. |
* [[Porting/Xonotic#Benchmarking|Xonotic]] (haven't tried it yet) | * [[Porting/Xonotic#Benchmarking|Xonotic]] (haven't tried it yet) | ||
* [https://github.com/phoronix-test-suite/phoronix-test-suite/blob/master/ob-cache/test-suites/pts/desktop-graphics-1.3.0/suite-definition.xml list of desktop graphics tests run by Phoronix] | * [https://github.com/phoronix-test-suite/phoronix-test-suite/blob/master/ob-cache/test-suites/pts/desktop-graphics-1.3.0/suite-definition.xml list of desktop graphics tests run by Phoronix] | ||
* [[Games Compatibility|list of games that run on POWER9]] | * [[Games Compatibility|list of games that run on POWER9]] | ||
+ | * [https://docs.mesa3d.org/ci/index.html#application-traces-replay Application traces replay] | ||
+ | ** [https://docs.mesa3d.org/ci/local-traces.html Running traces on a local machine] | ||
+ | ** [https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/llvmpipe/ci/traces-llvmpipe.yml CI traces for LLVMpipe] | ||
+ | |||
+ | === Building patched Mesa from source === | ||
+ | |||
+ | On Debian (might also work for derivatives such as Devuan and Ubuntu): | ||
+ | |||
+ | * <code>apt source mesa</code> | ||
+ | * <code>cd</code> to the source directory that was created by <code>apt source</code>. | ||
+ | * Install <code>build-essential</code>, and all packages listed in <code>Build-Depends</code> and <code>Build-Depends-indep</code> fields of <code>debian/control</code>. | ||
+ | * Apply whatever patches you like to the source. | ||
+ | * Add an entry to the top of <code>debian/changelog</code> with a new version number (incrementing the last number is an okay approach), so that <code>apt install</code> will know it's a new version. | ||
+ | * <code>dpkg-buildpackage -us -uc</code> | ||
+ | * <code>sudo apt install ../*.deb</code> | ||
== Thread Count == | == Thread Count == | ||
− | |||
− | |||
[https://docs.mesa3d.org/drivers/llvmpipe.html LLVMpipe] is [https://gitlab.freedesktop.org/mesa/mesa/-/blob/19682028eb0a2143c18ab2a26f3b23b7f74b2335/src/gallium/drivers/llvmpipe/lp_limits.h#L69 limited to 16 threads]. The only easily findable justification for this limit is in [https://gitlab.freedesktop.org/mesa/mesa/-/commit/38a751cbe85b7e31925931dc4994e7def5e5af96 a commit from 2013], where it was increased from 8 because a user reported on a mailing list that 16 was faster for them. Given that POWER9 systems often have much higher thread counts than this, this limit may be suboptimal for POWER9. | [https://docs.mesa3d.org/drivers/llvmpipe.html LLVMpipe] is [https://gitlab.freedesktop.org/mesa/mesa/-/blob/19682028eb0a2143c18ab2a26f3b23b7f74b2335/src/gallium/drivers/llvmpipe/lp_limits.h#L69 limited to 16 threads]. The only easily findable justification for this limit is in [https://gitlab.freedesktop.org/mesa/mesa/-/commit/38a751cbe85b7e31925931dc4994e7def5e5af96 a commit from 2013], where it was increased from 8 because a user reported on a mailing list that 16 was faster for them. Given that POWER9 systems often have much higher thread counts than this, this limit may be suboptimal for POWER9. | ||
Line 38: | Line 51: | ||
/** | /** | ||
Only in mesa-17.3.9/src/gallium/drivers/llvmpipe: lp_limits.h~ | Only in mesa-17.3.9/src/gallium/drivers/llvmpipe: lp_limits.h~ | ||
+ | |||
+ | === To 32 Threads (Merged) === | ||
[[User:JeremyRand|JeremyRand]] benchmarked OpenArena with LLVMpipe, and found that for 1920x1200 resolution on a 2x 4-core Talos II running Debian Bullseye, the following Mesa patch improved performance: | [[User:JeremyRand|JeremyRand]] benchmarked OpenArena with LLVMpipe, and found that for 1920x1200 resolution on a 2x 4-core Talos II running Debian Bullseye, the following Mesa patch improved performance: | ||
Line 84: | Line 99: | ||
It would be desirable to compare 32 threads to 64 threads on the same setup so that Jeremy's results and Nashimus's results can be more directly compared. | It would be desirable to compare 32 threads to 64 threads on the same setup so that Jeremy's results and Nashimus's results can be more directly compared. | ||
− | + | Nashimus then tested <code>#define LP_MAX_THREADS 144</code> with a more recent Mesa, with the following results: | |
+ | |||
+ | # Nashimus - Fedora 37 | ||
+ | |||
+ | Benchmarks should take around 4 hours to run, be patient! | ||
+ | |||
+ | CPU(s): 144 | ||
+ | OpenGL version string: 4.5 (Compatibility Profile) Mesa 22.2.1 | ||
+ | MODE: 3, 640 x 480 fullscreen hz:N/A | ||
+ | |||
+ | Frames TotalTime averageFPS minimum/average/maximum/std deviation | ||
+ | |||
+ | 16 Threads: | ||
+ | 3398 frames 185.7 seconds 18.3 fps 37.0/54.7/634.0/9.8 ms | ||
+ | 3398 frames 180.8 seconds 18.8 fps 36.0/53.2/653.0/9.7 ms | ||
+ | 3398 frames 197.1 seconds 17.2 fps 38.0/58.0/683.0/10.6 ms | ||
+ | 3398 frames 176.2 seconds 19.3 fps 35.0/51.9/673.0/9.5 ms | ||
+ | 3398 frames 186.5 seconds 18.2 fps 37.0/54.9/694.0/9.8 ms | ||
+ | |||
+ | 32 Threads: | ||
+ | 3398 frames 184.6 seconds 18.4 fps 32.0/54.3/683.0/9.8 ms | ||
+ | 3398 frames 199.4 seconds 17.0 fps 40.0/58.7/613.0/10.7 ms | ||
+ | 3398 frames 197.6 seconds 17.2 fps 38.0/58.1/641.0/10.6 ms | ||
+ | 3398 frames 186.2 seconds 18.3 fps 36.0/54.8/599.0/9.9 ms | ||
+ | 3398 frames 196.3 seconds 17.3 fps 39.0/57.8/604.0/10.3 ms | ||
+ | |||
+ | 48 Threads: | ||
+ | 3398 frames 197.6 seconds 17.2 fps 39.0/58.1/584.0/10.8 ms | ||
+ | 3398 frames 194.5 seconds 17.5 fps 34.0/57.2/571.0/10.4 ms | ||
+ | 3398 frames 194.6 seconds 17.5 fps 38.0/57.3/573.0/10.1 ms | ||
+ | 3398 frames 196.7 seconds 17.3 fps 38.0/57.9/597.0/10.7 ms | ||
+ | 3398 frames 176.3 seconds 19.3 fps 35.0/51.9/570.0/9.1 ms | ||
+ | |||
+ | 64 Threads: | ||
+ | 3398 frames 197.5 seconds 17.2 fps 38.0/58.1/575.0/10.6 ms | ||
+ | 3398 frames 175.2 seconds 19.4 fps 33.0/51.6/578.0/9.7 ms | ||
+ | 3398 frames 194.5 seconds 17.5 fps 37.0/57.3/608.0/10.6 ms | ||
+ | 3398 frames 173.4 seconds 19.6 fps 34.0/51.0/581.0/9.1 ms | ||
+ | 3398 frames 185.8 seconds 18.3 fps 35.0/54.7/587.0/9.5 ms | ||
+ | |||
+ | 72 Threads: | ||
+ | 3398 frames 195.2 seconds 17.4 fps 38.0/57.5/586.0/10.8 ms | ||
+ | 3398 frames 195.3 seconds 17.4 fps 39.0/57.5/587.0/10.6 ms | ||
+ | 3398 frames 182.8 seconds 18.6 fps 36.0/53.8/580.0/9.8 ms | ||
+ | 3398 frames 188.2 seconds 18.1 fps 37.0/55.4/580.0/10.0 ms | ||
+ | 3398 frames 197.9 seconds 17.2 fps 39.0/58.2/566.0/10.7 ms | ||
+ | |||
+ | 80 Threads: | ||
+ | 3398 frames 199.1 seconds 17.1 fps 40.0/58.6/568.0/10.7 ms | ||
+ | 3398 frames 179.4 seconds 18.9 fps 36.0/52.8/586.0/9.5 ms | ||
+ | 3398 frames 197.2 seconds 17.2 fps 39.0/58.0/580.0/10.5 ms | ||
+ | 3398 frames 179.9 seconds 18.9 fps 36.0/52.9/572.0/9.4 ms | ||
+ | 3398 frames 195.3 seconds 17.4 fps 38.0/57.5/585.0/10.6 ms | ||
+ | |||
+ | 88 Threads: | ||
+ | 3398 frames 190.8 seconds 17.8 fps 36.0/56.2/568.0/10.0 ms | ||
+ | 3398 frames 189.0 seconds 18.0 fps 35.0/55.6/609.0/10.2 ms | ||
+ | 3398 frames 197.3 seconds 17.2 fps 39.0/58.1/561.0/10.8 ms | ||
+ | 3398 frames 197.3 seconds 17.2 fps 40.0/58.1/577.0/10.5 ms | ||
+ | 3398 frames 175.4 seconds 19.4 fps 34.0/51.6/579.0/9.2 ms | ||
+ | |||
+ | 96 Threads: | ||
+ | 3398 frames 190.6 seconds 17.8 fps 37.0/56.1/595.0/10.6 ms | ||
+ | 3398 frames 176.6 seconds 19.2 fps 35.0/52.0/620.0/8.9 ms | ||
+ | 3398 frames 199.0 seconds 17.1 fps 39.0/58.6/607.0/10.6 ms | ||
+ | 3398 frames 195.8 seconds 17.4 fps 38.0/57.6/565.0/10.6 ms | ||
+ | 3398 frames 175.6 seconds 19.4 fps 34.0/51.7/587.0/9.2 ms | ||
+ | |||
+ | 128 Threads: | ||
+ | 3398 frames 197.1 seconds 17.2 fps 40.0/58.0/581.0/10.6 ms | ||
+ | 3398 frames 176.4 seconds 19.3 fps 35.0/51.9/611.0/9.4 ms | ||
+ | 3398 frames 197.3 seconds 17.2 fps 39.0/58.1/586.0/10.6 ms | ||
+ | 3398 frames 194.5 seconds 17.5 fps 37.0/57.2/580.0/10.9 ms | ||
+ | 3398 frames 184.8 seconds 18.4 fps 36.0/54.4/591.0/9.6 ms | ||
+ | |||
+ | 144 Threads: | ||
+ | 3398 frames 197.0 seconds 17.2 fps 38.0/58.0/595.0/10.6 ms | ||
+ | 3398 frames 198.2 seconds 17.1 fps 40.0/58.3/596.0/10.6 ms | ||
+ | 3398 frames 197.7 seconds 17.2 fps 38.0/58.2/599.0/10.6 ms | ||
+ | 3398 frames 196.3 seconds 17.3 fps 37.0/57.8/589.0/10.6 ms | ||
+ | 3398 frames 188.9 seconds 18.0 fps 38.0/55.6/600.0/9.7 ms | ||
+ | |||
+ | 160 Threads: | ||
+ | 3398 frames 199.3 seconds 17.1 fps 38.0/58.6/579.0/10.4 ms | ||
+ | 3398 frames 198.0 seconds 17.2 fps 39.0/58.3/565.0/10.5 ms | ||
+ | 3398 frames 197.3 seconds 17.2 fps 39.0/58.1/574.0/10.6 ms | ||
+ | 3398 frames 181.1 seconds 18.8 fps 36.0/53.3/594.0/10.0 ms | ||
+ | 3398 frames 194.0 seconds 17.5 fps 37.0/57.1/595.0/10.6 ms | ||
+ | |||
+ | 176 Threads: | ||
+ | 3398 frames 197.3 seconds 17.2 fps 37.0/58.1/573.0/10.6 ms | ||
+ | 3398 frames 197.6 seconds 17.2 fps 39.0/58.2/581.0/10.5 ms | ||
+ | 3398 frames 195.9 seconds 17.3 fps 39.0/57.6/579.0/10.5 ms | ||
+ | 3398 frames 189.5 seconds 17.9 fps 35.0/55.8/613.0/10.9 ms | ||
+ | 3398 frames 189.9 seconds 17.9 fps 38.0/55.9/579.0/10.0 ms | ||
+ | |||
+ | [[User:Thum|Thum]] tested <code>#define LP_MAX_THREADS 176</code> with 2 RAM modules, with the following results: | ||
+ | |||
+ | thum@tls0:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | tail -n1 | ||
+ | performance | ||
+ | thum@tls0:~$ uname -r | ||
+ | 5.19.0-2-powerpc64le | ||
+ | thum@tls0:~/llvmpipe-multithreaded-openarena-benchmark$ ./benchmark.sh | ||
+ | Benchmarks should take around 4 hours to run, be patient! | ||
+ | |||
+ | CPU(s): 176 | ||
+ | OpenGL version string: 4.5 (Compatibility Profile) Mesa 22.2.0 | ||
+ | MODE: 3, 640 x 480 windowed hz:N/A | ||
+ | |||
+ | Frames TotalTime averageFPS minimum/average/maximum/std deviation | ||
+ | |||
+ | 16 Threads: | ||
+ | 3398 frames 107.0 seconds 31.7 fps 13.0/31.5/4209.0/16.7 ms | ||
+ | 3398 frames 107.0 seconds 31.8 fps 12.0/31.5/4174.0/16.4 ms | ||
+ | 3398 frames 107.3 seconds 31.7 fps 12.0/31.6/4214.0/16.5 ms | ||
+ | 3398 frames 109.5 seconds 31.0 fps 14.0/32.2/4269.0/16.3 ms | ||
+ | 3398 frames 106.6 seconds 31.9 fps 12.0/31.4/4192.0/16.8 ms | ||
+ | |||
+ | 32 Threads: | ||
+ | 3398 frames 104.3 seconds 32.6 fps 11.0/30.7/4192.0/17.2 ms | ||
+ | 3398 frames 106.8 seconds 31.8 fps 12.0/31.4/4182.0/17.0 ms | ||
+ | 3398 frames 106.6 seconds 31.9 fps 11.0/31.4/4176.0/16.9 ms | ||
+ | 3398 frames 105.3 seconds 32.3 fps 11.0/31.0/4206.0/16.9 ms | ||
+ | 3398 frames 105.3 seconds 32.3 fps 11.0/31.0/4305.0/16.8 ms | ||
+ | |||
+ | 48 Threads: | ||
+ | 3398 frames 109.2 seconds 31.1 fps 12.0/32.1/4186.0/17.0 ms | ||
+ | 3398 frames 109.9 seconds 30.9 fps 12.0/32.4/4176.0/16.9 ms | ||
+ | 3398 frames 112.2 seconds 30.3 fps 12.0/33.0/4230.0/16.9 ms | ||
+ | 3398 frames 107.2 seconds 31.7 fps 11.0/31.6/4206.0/17.0 ms | ||
+ | 3398 frames 107.4 seconds 31.7 fps 11.0/31.6/4194.0/16.9 ms | ||
+ | |||
+ | 64 Threads: | ||
+ | 3398 frames 113.6 seconds 29.9 fps 13.0/33.4/4140.0/17.1 ms | ||
+ | 3398 frames 115.4 seconds 29.4 fps 14.0/34.0/4175.0/17.0 ms | ||
+ | 3398 frames 116.7 seconds 29.1 fps 14.0/34.3/4233.0/17.0 ms | ||
+ | 3398 frames 113.5 seconds 29.9 fps 13.0/33.4/4199.0/17.0 ms | ||
+ | 3398 frames 113.5 seconds 30.0 fps 12.0/33.4/4204.0/17.0 ms | ||
+ | |||
+ | 72 Threads: | ||
+ | 3398 frames 117.1 seconds 29.0 fps 13.0/34.5/4201.0/17.0 ms | ||
+ | 3398 frames 118.2 seconds 28.7 fps 14.0/34.8/4225.0/17.1 ms | ||
+ | 3398 frames 116.9 seconds 29.1 fps 13.0/34.4/4239.0/17.1 ms | ||
+ | 3398 frames 116.2 seconds 29.2 fps 13.0/34.2/4178.0/17.1 ms | ||
+ | 3398 frames 116.2 seconds 29.2 fps 13.0/34.2/4181.0/17.1 ms | ||
+ | |||
+ | 80 Threads: | ||
+ | 3398 frames 117.9 seconds 28.8 fps 13.0/34.7/4190.0/17.1 ms | ||
+ | 3398 frames 118.2 seconds 28.8 fps 14.0/34.8/4174.0/17.1 ms | ||
+ | 3398 frames 117.3 seconds 29.0 fps 14.0/34.5/4216.0/17.1 ms | ||
+ | 3398 frames 119.1 seconds 28.5 fps 13.0/35.0/4205.0/17.1 ms | ||
+ | 3398 frames 117.9 seconds 28.8 fps 13.0/34.7/4186.0/17.2 ms | ||
+ | |||
+ | 88 Threads: | ||
+ | 3398 frames 119.2 seconds 28.5 fps 14.0/35.1/4224.0/17.2 ms | ||
+ | 3398 frames 119.0 seconds 28.6 fps 13.0/35.0/4204.0/17.3 ms | ||
+ | 3398 frames 118.5 seconds 28.7 fps 14.0/34.9/4179.0/17.3 ms | ||
+ | 3398 frames 119.3 seconds 28.5 fps 14.0/35.1/4212.0/17.2 ms | ||
+ | 3398 frames 119.0 seconds 28.6 fps 13.0/35.0/4223.0/17.2 ms | ||
+ | |||
+ | 96 Threads: | ||
+ | 3398 frames 120.5 seconds 28.2 fps 14.0/35.4/4210.0/17.4 ms | ||
+ | 3398 frames 120.2 seconds 28.3 fps 14.0/35.4/4223.0/17.3 ms | ||
+ | 3398 frames 119.5 seconds 28.4 fps 12.0/35.2/4251.0/17.3 ms | ||
+ | 3398 frames 120.6 seconds 28.2 fps 14.0/35.5/4169.0/17.4 ms | ||
+ | 3398 frames 119.6 seconds 28.4 fps 14.0/35.2/4213.0/17.2 ms | ||
+ | |||
+ | 128 Threads: | ||
+ | 3398 frames 122.3 seconds 27.8 fps 14.0/36.0/4211.0/17.5 ms | ||
+ | 3398 frames 122.6 seconds 27.7 fps 14.0/36.1/4231.0/17.5 ms | ||
+ | 3398 frames 122.9 seconds 27.7 fps 14.0/36.2/4224.0/17.5 ms | ||
+ | 3398 frames 123.0 seconds 27.6 fps 15.0/36.2/4199.0/17.6 ms | ||
+ | 3398 frames 122.4 seconds 27.8 fps 14.0/36.0/4194.0/17.4 ms | ||
+ | |||
+ | 144 Threads: | ||
+ | 3398 frames 123.7 seconds 27.5 fps 14.0/36.4/4144.0/17.6 ms | ||
+ | 3398 frames 124.2 seconds 27.4 fps 14.0/36.6/4211.0/17.5 ms | ||
+ | 3398 frames 124.1 seconds 27.4 fps 14.0/36.5/4200.0/17.6 ms | ||
+ | 3398 frames 124.6 seconds 27.3 fps 14.0/36.7/4189.0/17.6 ms | ||
+ | 3398 frames 123.9 seconds 27.4 fps 14.0/36.5/4203.0/17.6 ms | ||
+ | |||
+ | 160 Threads: | ||
+ | 3398 frames 125.6 seconds 27.1 fps 14.0/37.0/4262.0/17.7 ms | ||
+ | 3398 frames 125.9 seconds 27.0 fps 14.0/37.1/4224.0/17.7 ms | ||
+ | 3398 frames 125.6 seconds 27.1 fps 15.0/37.0/4200.0/17.7 ms | ||
+ | 3398 frames 125.9 seconds 27.0 fps 14.0/37.1/4253.0/17.8 ms | ||
+ | 3398 frames 126.3 seconds 26.9 fps 15.0/37.2/4284.0/17.7 ms | ||
+ | |||
+ | 176 Threads: | ||
+ | 3398 frames 128.2 seconds 26.5 fps 15.0/37.7/4207.0/17.8 ms | ||
+ | 3398 frames 128.0 seconds 26.6 fps 14.0/37.7/4278.0/17.7 ms | ||
+ | 3398 frames 128.0 seconds 26.6 fps 14.0/37.7/4256.0/17.9 ms | ||
+ | 3398 frames 128.3 seconds 26.5 fps 14.0/37.8/4300.0/17.9 ms | ||
+ | 3398 frames 127.7 seconds 26.6 fps 15.0/37.6/4211.0/17.8 ms | ||
+ | |||
+ | It would be useful to get test results with 8 RAM modules for maximum memory bandwidth. | ||
=== Improving Thread Utilization === | === Improving Thread Utilization === | ||
+ | |||
+ | MR's: | ||
+ | |||
+ | * [https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14923 llvmpipe/lavapipe: add support for overlapping vertex and fragment processing.] | ||
+ | ** [https://airlied.blogspot.com/2022/02/optimising-llvmpipe-vertexfragment.html optimizing llvmpipe vertex/fragment processing.] | ||
+ | ** Merged Feb 21, 2022. | ||
+ | ** First released in Mesa 22.1.0. | ||
+ | ** First packaged in Debian Bookworm, Fedora 36, Ubuntu 22.10. | ||
From #dri-devel <ref>[https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&date=2022-10-04 #dri-devel 2022-10-04]</ref> <ref>[https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&date=2022-10-05 #dri-devel 2022-10-05]</ref>: | From #dri-devel <ref>[https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&date=2022-10-04 #dri-devel 2022-10-04]</ref> <ref>[https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&date=2022-10-05 #dri-devel 2022-10-05]</ref>: | ||
Line 134: | Line 352: | ||
== Vector Optimizations == | == Vector Optimizations == | ||
+ | |||
+ | It would be desirable to see the output of the <code>perf.sh</code> benchmark for OpenArena (see above link) on Debian Bookworm, so we can determine where bottlenecks might be in current Mesa versions. | ||
As of 2022 June 10, grepping <code>main</code> branch <code>src/gallium/drivers/llvmpipe/</code> for <code>altivec</code> yields only 2 files (<code>lp_rast_tri.c</code> and <code>lp_setup_tri.c</code>, both of which are POWER8 LE), while grepping for <code>PIPE_ARCH_SSE</code> yields 11 files. This seems to suggest that a lot of [[Power_ISA/Vector_Operations|POWER vector optimizations]] are missing from LLVMpipe. POWER9 vector optimizations (the LLVM <code>power9-vector</code> feature), and POWER8 BE optimizations, appear to be completely absent. | As of 2022 June 10, grepping <code>main</code> branch <code>src/gallium/drivers/llvmpipe/</code> for <code>altivec</code> yields only 2 files (<code>lp_rast_tri.c</code> and <code>lp_setup_tri.c</code>, both of which are POWER8 LE), while grepping for <code>PIPE_ARCH_SSE</code> yields 11 files. This seems to suggest that a lot of [[Power_ISA/Vector_Operations|POWER vector optimizations]] are missing from LLVMpipe. POWER9 vector optimizations (the LLVM <code>power9-vector</code> feature), and POWER8 BE optimizations, appear to be completely absent. | ||
+ | |||
+ | === <code>lp_rast_tri.c</code> and <code>lp_setup_tri.c</code> === | ||
+ | |||
+ | markos looked at the existing Altivec code in LLVMpipe (as of <code>main</code> 2022 October 6) and observed a lot of SSE-isms, probably because the LLVMpipe Altivec code was translated from the SSE code. It is likely that rewriting the Altivec code would yield performance improvements. | ||
=== <code>calc_fixed_position</code> === | === <code>calc_fixed_position</code> === | ||
Line 141: | Line 365: | ||
[[User:JeremyRand|JeremyRand]] ran the <code>perf.sh</code> benchmark for OpenArena (see above link) on Debian Bullseye (with 32 threads, see above patch), and found that 0.21% of CPU time (6th-highest-ranked function) was used by <code>triangle_ccw</code>, which is mostly a wrapper for the inline function <code>calc_fixed_position</code> in <code>lp_setup_tri.c</code>. This happens to be the only function that uses SSE in Debian Bullseye but is missing an Altivec implementation in <code>main</code> branch (other SSE-utilizing functions were added to <code>main</code> after Debian Bullseye). | [[User:JeremyRand|JeremyRand]] ran the <code>perf.sh</code> benchmark for OpenArena (see above link) on Debian Bullseye (with 32 threads, see above patch), and found that 0.21% of CPU time (6th-highest-ranked function) was used by <code>triangle_ccw</code>, which is mostly a wrapper for the inline function <code>calc_fixed_position</code> in <code>lp_setup_tri.c</code>. This happens to be the only function that uses SSE in Debian Bullseye but is missing an Altivec implementation in <code>main</code> branch (other SSE-utilizing functions were added to <code>main</code> after Debian Bullseye). | ||
− | Jeremy enabled [https://www.talospace.com/2019/07/easier-power-vectorizing-for-fun-and.html build-time SSE intrinsics translation] in the <code>calc_fixed_position</code> function via | + | Jeremy enabled [https://www.talospace.com/2019/07/easier-power-vectorizing-for-fun-and.html build-time SSE intrinsics translation] in the <code>calc_fixed_position</code> function via [[File:LLVMpipe-Emulate-SSE-intrinsics-in-calc_fixed_position.patch]]: |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
When combined with the 32-thread patch from the above section, Jeremy obtained the following benchmarks in OpenArena: | When combined with the 32-thread patch from the above section, Jeremy obtained the following benchmarks in OpenArena: | ||
Line 221: | Line 403: | ||
144 threads: 3398 frames 208.5 seconds 16.3 fps 30.0/61.4/8364.0/21.8 ms | 144 threads: 3398 frames 208.5 seconds 16.3 fps 30.0/61.4/8364.0/21.8 ms | ||
144 threads and SIMD patch: 3398 frames 208.7 seconds 16.3 fps 30.0/61.4/8327.0/21.8 ms | 144 threads and SIMD patch: 3398 frames 208.7 seconds 16.3 fps 30.0/61.4/8327.0/21.8 ms | ||
+ | |||
+ | Jeremy obtained the follow apitrace benchmarks, indicating that the patch does not appear to produce any measurable performance change in OpenArena: | ||
+ | |||
+ | $ ./benchmark-apitrace-openarena.sh | ||
+ | Benchmarks should take around 4 hours to run, be patient! | ||
+ | |||
+ | CPU(s): 32 | ||
+ | Memory: | ||
+ | WARNING: you should run this program as super-user. | ||
+ | size: 8GiB | ||
+ | size: 64GiB | ||
+ | size: 16GiB | ||
+ | size: 64GiB | ||
+ | size: 8GiB | ||
+ | size: 64GiB | ||
+ | size: 16GiB | ||
+ | size: 8GiB | ||
+ | WARNING: output may be incomplete or inaccurate, you should run this program as super-user. | ||
+ | OpenGL version string: 4.5 (Compatibility Profile) Mesa 22.2.3 | ||
+ | LP_MAX_THREADS: 16 | ||
+ | |||
+ | 16 Threads: | ||
+ | control Rendered 3436 frames in 248.303 secs, average of 13.8379 fps | ||
+ | experimental Rendered 3436 frames in 247.75 secs, average of 13.8688 fps | ||
+ | control Rendered 3436 frames in 247.058 secs, average of 13.9077 fps | ||
+ | experimental Rendered 3436 frames in 248.886 secs, average of 13.8055 fps | ||
+ | control Rendered 3436 frames in 247.324 secs, average of 13.8927 fps | ||
+ | experimental Rendered 3436 frames in 248.16 secs, average of 13.8459 fps | ||
+ | control Rendered 3436 frames in 247.34 secs, average of 13.8918 fps | ||
+ | experimental Rendered 3436 frames in 247.173 secs, average of 13.9012 fps | ||
+ | control Rendered 3436 frames in 246.754 secs, average of 13.9248 fps | ||
+ | experimental Rendered 3436 frames in 246.935 secs, average of 13.9146 fps | ||
+ | control Rendered 3436 frames in 246.852 secs, average of 13.9193 fps | ||
+ | experimental Rendered 3436 frames in 247.166 secs, average of 13.9016 fps | ||
+ | control Rendered 3436 frames in 247.431 secs, average of 13.8867 fps | ||
+ | experimental Rendered 3436 frames in 247.036 secs, average of 13.9089 fps | ||
+ | control Rendered 3436 frames in 245.538 secs, average of 13.9937 fps | ||
+ | experimental Rendered 3436 frames in 256.544 secs, average of 13.3934 fps | ||
+ | control Rendered 3436 frames in 246.85 secs, average of 13.9194 fps | ||
+ | experimental Rendered 3436 frames in 246.722 secs, average of 13.9266 fps | ||
+ | control Rendered 3436 frames in 247.268 secs, average of 13.8958 fps | ||
+ | experimental Rendered 3436 frames in 248.315 secs, average of 13.8372 fps | ||
+ | |||
+ | === lp_linear_fastpath.c === | ||
+ | |||
+ | It would be useful to find test cases that exercise <code>lp_linear_fastpath.c</code> (2D fastpath of LLVMpipe), so that we can run benchmarks of porting it to POWER9. | ||
+ | |||
+ | The fastpath does not appear to be exercised by Falcon-mkxp as of Fedora 36 on x86_64 according to tests by Jeremy. | ||
+ | |||
+ | According to DemiMarie and Dave Airlie, the fastpath may be exercised by GTK4 apps, but is not currently exercised by GNOME Shell <ref>[https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&date=2022-11-27 #dri-devel 2022-11-27]</ref>: | ||
+ | |||
+ | <blockquote> | ||
+ | '''00:49 Jeremy_Rand_Talos__''': Is there a recommended way to benchmark the LLVMpipe 2D fastpath that was introduced in MR !11969 ? Preferably a GNU/Linux application that I can run through apitrace.<br /> | ||
+ | '''00:52 Jeremy_Rand_Talos__''': The MR text mentions various Windows things (I prefer Linux), desktop compositors (seems hard to run through apitrace?), HTML browsers (ditto due to sandboxing), etc. I assume I can't be the first person who wants to do such testing, so presumably there's already more info about what apps are suitable that I failed to find?<br /> | ||
+ | '''01:57 DemiMarie''': GTK4 apps<br /> | ||
+ | '''02:51 airlied''': Jeremy_Rand_Talos: not really, lots of things use it for some paths like blits and copies<br /> | ||
+ | '''02:52 airlied''': id like to get gnome shell using it, but depth buffers stops it<br /> | ||
+ | '''04:54 DemiMarie''': GTK4 uses OpenGL for rendering<br /> | ||
+ | </blockquote> | ||
+ | |||
+ | === Other linear rasterizer functions === | ||
+ | |||
+ | Jeremy found that GNOME Maps (in Fedora 37) does exercise some of the linear rasterizer functions, so it may be a candidate for benchmarking LLVMpipe SIMD improvements. However, its usage seems relatively small, so it may not yield a large performance bump. | ||
+ | |||
+ | Strangely, on Fedora 37 x86_64, the various <code>lp_linear</code> functions show up in <code>perf</code> output when running <code>perf</code> directly on GNOME Maps, but they do not show up when replaying an apitrace. This probably poses a problem for benchmarking. | ||
=== Other functions === | === Other functions === | ||
Line 257: | Line 504: | ||
3398 frames 191.6 seconds 17.7 fps 31.0/56.4/1587.0/14.5 ms | 3398 frames 191.6 seconds 17.7 fps 31.0/56.4/1587.0/14.5 ms | ||
3398 frames 191.4 seconds 17.8 fps 31.0/56.3/1510.0/13.6 ms | 3398 frames 191.4 seconds 17.8 fps 31.0/56.3/1510.0/13.6 ms | ||
+ | |||
+ | == References == | ||
+ | |||
+ | <references/> | ||
[[Category:Ports]] | [[Category:Ports]] |
Latest revision as of 21:14, 3 December 2022
LLVMpipe (source code) runs on POWER, but performance has room for improvement.
Testing Notes
If your machine has a discrete accelerated GPU, then you'll probably need to set these environment variables in order to test LLVMpipe:
export LIBGL_ALWAYS_SOFTWARE=true export GALLIUM_DRIVER=llvmpipe
Potentially useful benchmarking tools
- glmark2 (seems to be not very useful at noticing small LLVMpipe optimizations)
- OpenArena (seems to work okay for testing LLVMpipe optimizations).
- Download this benchmarking script. To compare FPS with different thread counts, run
./benchmark.sh
; to measure which functions are potential bottlenecks, run./perf.sh
.
- Download this benchmarking script. To compare FPS with different thread counts, run
- Xonotic (haven't tried it yet)
- list of desktop graphics tests run by Phoronix
- list of games that run on POWER9
- Application traces replay
Building patched Mesa from source
On Debian (might also work for derivatives such as Devuan and Ubuntu):
apt source mesa
cd
to the source directory that was created byapt source
.- Install
build-essential
, and all packages listed inBuild-Depends
andBuild-Depends-indep
fields ofdebian/control
. - Apply whatever patches you like to the source.
- Add an entry to the top of
debian/changelog
with a new version number (incrementing the last number is an okay approach), so thatapt install
will know it's a new version. dpkg-buildpackage -us -uc
sudo apt install ../*.deb
Thread Count
LLVMpipe is limited to 16 threads. The only easily findable justification for this limit is in a commit from 2013, where it was increased from 8 because a user reported on a mailing list that 16 was faster for them. Given that POWER9 systems often have much higher thread counts than this, this limit may be suboptimal for POWER9.
luke-jr reports that bumping the limit to 128 threads noticeably improved performance in 3D games, e.g. Jedi Academy (detail set to High, Texture to Very High, Texture Filter BILINEAR, Detailed Shaders ON, Video Sync OFF, resolution 800x600) went from ~15 fps with 16 threads to ~25 fps with 64 threads (on a 2x 8-core Talos II). However, it also had the side effect that most GUI applications spawned more LLVMpipe threads, which was annoying in gdb
/top
. Luke worked around this issue by setting the environment variable LP_NUM_THREADS=2
globally, and overriding for 3D applications that needed more threads. Luke's patch is:
diff -ur mesa-17.3.9.orig/src/gallium/drivers/llvmpipe/lp_limits.h mesa-17.3.9/src/gallium/drivers/llvmpipe/lp_limits.h --- mesa-17.3.9.orig/src/gallium/drivers/llvmpipe/lp_limits.h 2018-04-18 04:44:00.000000000 -0400 +++ mesa-17.3.9/src/gallium/drivers/llvmpipe/lp_limits.h 2018-05-02 05:20:57.586000000 -0400 @@ -61,7 +61,7 @@ #define LP_MAX_WIDTH (1 << (LP_MAX_TEXTURE_LEVELS - 1)) -#define LP_MAX_THREADS 16 +#define LP_MAX_THREADS 128 /** Only in mesa-17.3.9/src/gallium/drivers/llvmpipe: lp_limits.h~
To 32 Threads (Merged)
JeremyRand benchmarked OpenArena with LLVMpipe, and found that for 1920x1200 resolution on a 2x 4-core Talos II running Debian Bullseye, the following Mesa patch improved performance:
diff -ur stock/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h --- stock/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h 2021-03-24 14:10:48.744070300 -0500 +++ 32-threads/mesa-20.3.5/src/gallium/drivers/llvmpipe/lp_limits.h 2022-07-02 04:08:46.880000000 -0500 @@ -66,7 +66,9 @@ #define LP_MAX_SAMPLES 4 -#define LP_MAX_THREADS 16 +// Bumped by Jeremy +//#define LP_MAX_THREADS 16 +#define LP_MAX_THREADS 32 /**
Before the 32-thread patch:
3398 frames 487.6 seconds 7.0 fps 83.0/143.5/7186.0/25.1 ms 3398 frames 472.7 seconds 7.2 fps 86.0/139.1/2242.0/22.9 ms 3398 frames 466.5 seconds 7.3 fps 86.0/137.3/943.0/21.1 ms 3398 frames 467.5 seconds 7.3 fps 83.0/137.6/846.0/22.4 ms 3398 frames 474.8 seconds 7.2 fps 86.0/139.7/779.0/22.8 ms
After the 32-thread patch:
3398 frames 417.7 seconds 8.1 fps 77.0/122.9/1748.0/18.9 ms 3398 frames 417.9 seconds 8.1 fps 76.0/123.0/997.0/19.6 ms 3398 frames 419.9 seconds 8.1 fps 76.0/123.6/806.0/19.8 ms 3398 frames 422.7 seconds 8.0 fps 75.0/124.4/758.0/21.1 ms 3398 frames 419.1 seconds 8.1 fps 75.0/123.3/730.0/20.4 ms
The 32-thread patch has been merged to Mesa on 2022 October 4.
To More Than 32 Threads
Nashimus tested #define LP_MAX_THREADS 144
on a 2x 18-core Talos II, with the following results:
16 threads: 3398 frames 312.5 seconds 10.9 fps 55.0/92.0/8375.0/21.2 ms 64 threads: 3398 frames 221.5 seconds 15.3 fps 33.0/65.2/8390.0/21.4 ms 144 threads: 3398 frames 208.5 seconds 16.3 fps 30.0/61.4/8364.0/21.8 ms
It would be desirable to compare 32 threads to 64 threads on the same setup so that Jeremy's results and Nashimus's results can be more directly compared.
Nashimus then tested #define LP_MAX_THREADS 144
with a more recent Mesa, with the following results:
# Nashimus - Fedora 37 Benchmarks should take around 4 hours to run, be patient! CPU(s): 144 OpenGL version string: 4.5 (Compatibility Profile) Mesa 22.2.1 MODE: 3, 640 x 480 fullscreen hz:N/A Frames TotalTime averageFPS minimum/average/maximum/std deviation 16 Threads: 3398 frames 185.7 seconds 18.3 fps 37.0/54.7/634.0/9.8 ms 3398 frames 180.8 seconds 18.8 fps 36.0/53.2/653.0/9.7 ms 3398 frames 197.1 seconds 17.2 fps 38.0/58.0/683.0/10.6 ms 3398 frames 176.2 seconds 19.3 fps 35.0/51.9/673.0/9.5 ms 3398 frames 186.5 seconds 18.2 fps 37.0/54.9/694.0/9.8 ms 32 Threads: 3398 frames 184.6 seconds 18.4 fps 32.0/54.3/683.0/9.8 ms 3398 frames 199.4 seconds 17.0 fps 40.0/58.7/613.0/10.7 ms 3398 frames 197.6 seconds 17.2 fps 38.0/58.1/641.0/10.6 ms 3398 frames 186.2 seconds 18.3 fps 36.0/54.8/599.0/9.9 ms 3398 frames 196.3 seconds 17.3 fps 39.0/57.8/604.0/10.3 ms 48 Threads: 3398 frames 197.6 seconds 17.2 fps 39.0/58.1/584.0/10.8 ms 3398 frames 194.5 seconds 17.5 fps 34.0/57.2/571.0/10.4 ms 3398 frames 194.6 seconds 17.5 fps 38.0/57.3/573.0/10.1 ms 3398 frames 196.7 seconds 17.3 fps 38.0/57.9/597.0/10.7 ms 3398 frames 176.3 seconds 19.3 fps 35.0/51.9/570.0/9.1 ms 64 Threads: 3398 frames 197.5 seconds 17.2 fps 38.0/58.1/575.0/10.6 ms 3398 frames 175.2 seconds 19.4 fps 33.0/51.6/578.0/9.7 ms 3398 frames 194.5 seconds 17.5 fps 37.0/57.3/608.0/10.6 ms 3398 frames 173.4 seconds 19.6 fps 34.0/51.0/581.0/9.1 ms 3398 frames 185.8 seconds 18.3 fps 35.0/54.7/587.0/9.5 ms 72 Threads: 3398 frames 195.2 seconds 17.4 fps 38.0/57.5/586.0/10.8 ms 3398 frames 195.3 seconds 17.4 fps 39.0/57.5/587.0/10.6 ms 3398 frames 182.8 seconds 18.6 fps 36.0/53.8/580.0/9.8 ms 3398 frames 188.2 seconds 18.1 fps 37.0/55.4/580.0/10.0 ms 3398 frames 197.9 seconds 17.2 fps 39.0/58.2/566.0/10.7 ms 80 Threads: 3398 frames 199.1 seconds 17.1 fps 40.0/58.6/568.0/10.7 ms 3398 frames 179.4 seconds 18.9 fps 36.0/52.8/586.0/9.5 ms 3398 frames 197.2 seconds 17.2 fps 39.0/58.0/580.0/10.5 ms 3398 frames 179.9 seconds 18.9 fps 36.0/52.9/572.0/9.4 ms 3398 frames 195.3 seconds 17.4 fps 38.0/57.5/585.0/10.6 ms 88 Threads: 3398 frames 190.8 seconds 17.8 fps 36.0/56.2/568.0/10.0 ms 3398 frames 189.0 seconds 18.0 fps 35.0/55.6/609.0/10.2 ms 3398 frames 197.3 seconds 17.2 fps 39.0/58.1/561.0/10.8 ms 3398 frames 197.3 seconds 17.2 fps 40.0/58.1/577.0/10.5 ms 3398 frames 175.4 seconds 19.4 fps 34.0/51.6/579.0/9.2 ms 96 Threads: 3398 frames 190.6 seconds 17.8 fps 37.0/56.1/595.0/10.6 ms 3398 frames 176.6 seconds 19.2 fps 35.0/52.0/620.0/8.9 ms 3398 frames 199.0 seconds 17.1 fps 39.0/58.6/607.0/10.6 ms 3398 frames 195.8 seconds 17.4 fps 38.0/57.6/565.0/10.6 ms 3398 frames 175.6 seconds 19.4 fps 34.0/51.7/587.0/9.2 ms 128 Threads: 3398 frames 197.1 seconds 17.2 fps 40.0/58.0/581.0/10.6 ms 3398 frames 176.4 seconds 19.3 fps 35.0/51.9/611.0/9.4 ms 3398 frames 197.3 seconds 17.2 fps 39.0/58.1/586.0/10.6 ms 3398 frames 194.5 seconds 17.5 fps 37.0/57.2/580.0/10.9 ms 3398 frames 184.8 seconds 18.4 fps 36.0/54.4/591.0/9.6 ms 144 Threads: 3398 frames 197.0 seconds 17.2 fps 38.0/58.0/595.0/10.6 ms 3398 frames 198.2 seconds 17.1 fps 40.0/58.3/596.0/10.6 ms 3398 frames 197.7 seconds 17.2 fps 38.0/58.2/599.0/10.6 ms 3398 frames 196.3 seconds 17.3 fps 37.0/57.8/589.0/10.6 ms 3398 frames 188.9 seconds 18.0 fps 38.0/55.6/600.0/9.7 ms 160 Threads: 3398 frames 199.3 seconds 17.1 fps 38.0/58.6/579.0/10.4 ms 3398 frames 198.0 seconds 17.2 fps 39.0/58.3/565.0/10.5 ms 3398 frames 197.3 seconds 17.2 fps 39.0/58.1/574.0/10.6 ms 3398 frames 181.1 seconds 18.8 fps 36.0/53.3/594.0/10.0 ms 3398 frames 194.0 seconds 17.5 fps 37.0/57.1/595.0/10.6 ms 176 Threads: 3398 frames 197.3 seconds 17.2 fps 37.0/58.1/573.0/10.6 ms 3398 frames 197.6 seconds 17.2 fps 39.0/58.2/581.0/10.5 ms 3398 frames 195.9 seconds 17.3 fps 39.0/57.6/579.0/10.5 ms 3398 frames 189.5 seconds 17.9 fps 35.0/55.8/613.0/10.9 ms 3398 frames 189.9 seconds 17.9 fps 38.0/55.9/579.0/10.0 ms
Thum tested #define LP_MAX_THREADS 176
with 2 RAM modules, with the following results:
thum@tls0:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | tail -n1 performance thum@tls0:~$ uname -r 5.19.0-2-powerpc64le thum@tls0:~/llvmpipe-multithreaded-openarena-benchmark$ ./benchmark.sh Benchmarks should take around 4 hours to run, be patient! CPU(s): 176 OpenGL version string: 4.5 (Compatibility Profile) Mesa 22.2.0 MODE: 3, 640 x 480 windowed hz:N/A Frames TotalTime averageFPS minimum/average/maximum/std deviation 16 Threads: 3398 frames 107.0 seconds 31.7 fps 13.0/31.5/4209.0/16.7 ms 3398 frames 107.0 seconds 31.8 fps 12.0/31.5/4174.0/16.4 ms 3398 frames 107.3 seconds 31.7 fps 12.0/31.6/4214.0/16.5 ms 3398 frames 109.5 seconds 31.0 fps 14.0/32.2/4269.0/16.3 ms 3398 frames 106.6 seconds 31.9 fps 12.0/31.4/4192.0/16.8 ms 32 Threads: 3398 frames 104.3 seconds 32.6 fps 11.0/30.7/4192.0/17.2 ms 3398 frames 106.8 seconds 31.8 fps 12.0/31.4/4182.0/17.0 ms 3398 frames 106.6 seconds 31.9 fps 11.0/31.4/4176.0/16.9 ms 3398 frames 105.3 seconds 32.3 fps 11.0/31.0/4206.0/16.9 ms 3398 frames 105.3 seconds 32.3 fps 11.0/31.0/4305.0/16.8 ms 48 Threads: 3398 frames 109.2 seconds 31.1 fps 12.0/32.1/4186.0/17.0 ms 3398 frames 109.9 seconds 30.9 fps 12.0/32.4/4176.0/16.9 ms 3398 frames 112.2 seconds 30.3 fps 12.0/33.0/4230.0/16.9 ms 3398 frames 107.2 seconds 31.7 fps 11.0/31.6/4206.0/17.0 ms 3398 frames 107.4 seconds 31.7 fps 11.0/31.6/4194.0/16.9 ms 64 Threads: 3398 frames 113.6 seconds 29.9 fps 13.0/33.4/4140.0/17.1 ms 3398 frames 115.4 seconds 29.4 fps 14.0/34.0/4175.0/17.0 ms 3398 frames 116.7 seconds 29.1 fps 14.0/34.3/4233.0/17.0 ms 3398 frames 113.5 seconds 29.9 fps 13.0/33.4/4199.0/17.0 ms 3398 frames 113.5 seconds 30.0 fps 12.0/33.4/4204.0/17.0 ms 72 Threads: 3398 frames 117.1 seconds 29.0 fps 13.0/34.5/4201.0/17.0 ms 3398 frames 118.2 seconds 28.7 fps 14.0/34.8/4225.0/17.1 ms 3398 frames 116.9 seconds 29.1 fps 13.0/34.4/4239.0/17.1 ms 3398 frames 116.2 seconds 29.2 fps 13.0/34.2/4178.0/17.1 ms 3398 frames 116.2 seconds 29.2 fps 13.0/34.2/4181.0/17.1 ms 80 Threads: 3398 frames 117.9 seconds 28.8 fps 13.0/34.7/4190.0/17.1 ms 3398 frames 118.2 seconds 28.8 fps 14.0/34.8/4174.0/17.1 ms 3398 frames 117.3 seconds 29.0 fps 14.0/34.5/4216.0/17.1 ms 3398 frames 119.1 seconds 28.5 fps 13.0/35.0/4205.0/17.1 ms 3398 frames 117.9 seconds 28.8 fps 13.0/34.7/4186.0/17.2 ms 88 Threads: 3398 frames 119.2 seconds 28.5 fps 14.0/35.1/4224.0/17.2 ms 3398 frames 119.0 seconds 28.6 fps 13.0/35.0/4204.0/17.3 ms 3398 frames 118.5 seconds 28.7 fps 14.0/34.9/4179.0/17.3 ms 3398 frames 119.3 seconds 28.5 fps 14.0/35.1/4212.0/17.2 ms 3398 frames 119.0 seconds 28.6 fps 13.0/35.0/4223.0/17.2 ms 96 Threads: 3398 frames 120.5 seconds 28.2 fps 14.0/35.4/4210.0/17.4 ms 3398 frames 120.2 seconds 28.3 fps 14.0/35.4/4223.0/17.3 ms 3398 frames 119.5 seconds 28.4 fps 12.0/35.2/4251.0/17.3 ms 3398 frames 120.6 seconds 28.2 fps 14.0/35.5/4169.0/17.4 ms 3398 frames 119.6 seconds 28.4 fps 14.0/35.2/4213.0/17.2 ms 128 Threads: 3398 frames 122.3 seconds 27.8 fps 14.0/36.0/4211.0/17.5 ms 3398 frames 122.6 seconds 27.7 fps 14.0/36.1/4231.0/17.5 ms 3398 frames 122.9 seconds 27.7 fps 14.0/36.2/4224.0/17.5 ms 3398 frames 123.0 seconds 27.6 fps 15.0/36.2/4199.0/17.6 ms 3398 frames 122.4 seconds 27.8 fps 14.0/36.0/4194.0/17.4 ms 144 Threads: 3398 frames 123.7 seconds 27.5 fps 14.0/36.4/4144.0/17.6 ms 3398 frames 124.2 seconds 27.4 fps 14.0/36.6/4211.0/17.5 ms 3398 frames 124.1 seconds 27.4 fps 14.0/36.5/4200.0/17.6 ms 3398 frames 124.6 seconds 27.3 fps 14.0/36.7/4189.0/17.6 ms 3398 frames 123.9 seconds 27.4 fps 14.0/36.5/4203.0/17.6 ms 160 Threads: 3398 frames 125.6 seconds 27.1 fps 14.0/37.0/4262.0/17.7 ms 3398 frames 125.9 seconds 27.0 fps 14.0/37.1/4224.0/17.7 ms 3398 frames 125.6 seconds 27.1 fps 15.0/37.0/4200.0/17.7 ms 3398 frames 125.9 seconds 27.0 fps 14.0/37.1/4253.0/17.8 ms 3398 frames 126.3 seconds 26.9 fps 15.0/37.2/4284.0/17.7 ms 176 Threads: 3398 frames 128.2 seconds 26.5 fps 15.0/37.7/4207.0/17.8 ms 3398 frames 128.0 seconds 26.6 fps 14.0/37.7/4278.0/17.7 ms 3398 frames 128.0 seconds 26.6 fps 14.0/37.7/4256.0/17.9 ms 3398 frames 128.3 seconds 26.5 fps 14.0/37.8/4300.0/17.9 ms 3398 frames 127.7 seconds 26.6 fps 15.0/37.6/4211.0/17.8 ms
It would be useful to get test results with 8 RAM modules for maximum memory bandwidth.
Improving Thread Utilization
MR's:
- llvmpipe/lavapipe: add support for overlapping vertex and fragment processing.
- optimizing llvmpipe vertex/fragment processing.
- Merged Feb 21, 2022.
- First released in Mesa 22.1.0.
- First packaged in Debian Bookworm, Fedora 36, Ubuntu 22.10.
23:49 Jeremy_Rand_Talos: In https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18415#note_1542451 , sroland said "there's just bottlenecks which makes using high number of threads a bit questionable".
23:49 Jeremy_Rand_Talos: Are those bottlenecks documented anywhere? I see a nontrivial amount of llvmpipe CPU usage (e.g. triangle_ccw) in the main thread (as opposed to the spawned threads) via perf; is that what he meant?
23:50 airlied: Jeremy_Rand_Talos: I've removed some of those bottlenecks, but I think binning is probably the largest one remaining
23:51 Jeremy_Rand_Talos: airlied, how feasible is it to improve the remaining such bottlenecks?
23:52 Jeremy_Rand_Talos: airlied, and are there any docs (formal or informal) where I could read up on the current state of such things?
23:58 airlied: Jeremy_Rand_Talos: how much time you got? :-)
23:58 airlied: the main thing would be to find a benchmark where you care
23:58 airlied: then profile, profile, profile, and see what bottlenecks
23:59 airlied: most of the bottlenecks are memory bandwidth on fragment shader execution
23:59 airlied: those are hard to fix :-P
23:59 ajax: airlied: did we finally get overlapping vs/fs/present working?
23:59 airlied: ajax: yes it seems to definitely be bug free this time :-P
00:00 Jeremy_Rand_Talos: airlied, so, I'm from the Talos Workstation community; some of us are on up to 176 hardware threads; it would be nice to be able to actually leverage that with llvmpipe.
00:00 airlied: now the tests that have vertex heavy workloads stall out on binning a lot
00:00 airlied: Jeremy_Rand_Talos: it probably comes down to memory b
00:00 airlied: bw
00:00 ajax: then... yeah pretty much all you have left for llvmpipe perf is either smarter image layouts or reducing the cost of binning
00:01 airlied: ajax: threaded binning was a vague handwave I had
00:01 airlied: but I'm not so sure how to make that a thing
00:01 ajax: i think swr had a multiply-and-surrender approach to that
00:01 airlied: smarter image layouts would be something if you could figure out what would work on a modern CPU
00:02 airlied: like you'd have to know where things are having cacheline pains
00:02 ajax: 2x2 microtiles and page-sized macrotiles would go a long way
00:02 ajax: or rather: i don't think there's much to be gained beyond that
00:02 airlied:wonders if vmware had that at one point, and didn't see enough throughput changes
00:03 airlied: though someone suggested tiled framebuffers might be a better win
00:04 airlied: https://gitlab.freedesktop.org/mesa/mesa/-/issues/6972 has some notes
00:10 Jeremy_Rand_Talos: airlied, has it been considered whether memory bandwidth is still the bottleneck regardless of architecture? My loose understanding is that POWER9 often has better mem bw than x86_64.
00:12 airlied: Jeremy_Rand_Talos: again it depends on the load being tested, adding more threads with a complex fragment shader might show where things stop scaling
00:12 airlied: like power9 might have more mem bw, but it still might get saturated
00:13 Jeremy_Rand_Talos: airlied, "memory bandwidth on fragment shader execution" <-- what function name(s) would this show up as in perf?
00:13 airlied: Jeremy_Rand_Talos: it will show up as some JIT code
00:13 airlied: unfortunately that is another problem, getting JIT debugging going properly is kinda not there
00:14 airlied: so it's not that easy to spot where the bottlenecks are
00:14 airlied: I've never really figured out the best way to close that gap
00:14 Jeremy_Rand_Talos: airlied, ah, ok. So if I can find some bottleneck function in perf that's not marked as JIT, that would be likely to be lower-hanging fruit?
00:14 airlied: Jeremy_Rand_Talos: yes
00:15 airlied: esp if it's in the main thread not one of the side threads
00:15 Jeremy_Rand_Talos: airlied, I see, that's good info to have.
00:17 airlied: but yeah digging into anything that could make jit more debuggable might be a useful task
Vector Optimizations
It would be desirable to see the output of the perf.sh
benchmark for OpenArena (see above link) on Debian Bookworm, so we can determine where bottlenecks might be in current Mesa versions.
As of 2022 June 10, grepping main
branch src/gallium/drivers/llvmpipe/
for altivec
yields only 2 files (lp_rast_tri.c
and lp_setup_tri.c
, both of which are POWER8 LE), while grepping for PIPE_ARCH_SSE
yields 11 files. This seems to suggest that a lot of POWER vector optimizations are missing from LLVMpipe. POWER9 vector optimizations (the LLVM power9-vector
feature), and POWER8 BE optimizations, appear to be completely absent.
lp_rast_tri.c
and lp_setup_tri.c
markos looked at the existing Altivec code in LLVMpipe (as of main
2022 October 6) and observed a lot of SSE-isms, probably because the LLVMpipe Altivec code was translated from the SSE code. It is likely that rewriting the Altivec code would yield performance improvements.
calc_fixed_position
JeremyRand ran the perf.sh
benchmark for OpenArena (see above link) on Debian Bullseye (with 32 threads, see above patch), and found that 0.21% of CPU time (6th-highest-ranked function) was used by triangle_ccw
, which is mostly a wrapper for the inline function calc_fixed_position
in lp_setup_tri.c
. This happens to be the only function that uses SSE in Debian Bullseye but is missing an Altivec implementation in main
branch (other SSE-utilizing functions were added to main
after Debian Bullseye).
Jeremy enabled build-time SSE intrinsics translation in the calc_fixed_position
function via File:LLVMpipe-Emulate-SSE-intrinsics-in-calc fixed position.patch:
When combined with the 32-thread patch from the above section, Jeremy obtained the following benchmarks in OpenArena:
32 threads without SIMD patch:
3398 frames 417.7 seconds 8.1 fps 77.0/122.9/1748.0/18.9 ms 3398 frames 417.9 seconds 8.1 fps 76.0/123.0/997.0/19.6 ms 3398 frames 419.9 seconds 8.1 fps 76.0/123.6/806.0/19.8 ms 3398 frames 422.7 seconds 8.0 fps 75.0/124.4/758.0/21.1 ms 3398 frames 419.1 seconds 8.1 fps 75.0/123.3/730.0/20.4 ms
32 threads with SIMD patch:
3398 frames 418.5 seconds 8.1 fps 76.0/123.2/865.0/19.7 ms 3398 frames 414.7 seconds 8.2 fps 74.0/122.1/805.0/19.7 ms 3398 frames 418.8 seconds 8.1 fps 74.0/123.2/701.0/20.0 ms 3398 frames 424.0 seconds 8.0 fps 77.0/124.8/524.0/21.7 ms 3398 frames 414.1 seconds 8.2 fps 74.0/121.9/621.0/19.2 ms
This seems like an improvement, though a quite small one.
Jeremy checked the function size using nm:
nm -S --size-sort -t d ./build/src/gallium/drivers/llvmpipe/libllvmpipe.a.p/lp_setup_tri.c.o | grep ' triangle_ccw' # Without SIMD patch: 0000000000000000 0000000000000736 t triangle_ccw # With SIMD patch: 0000000000000000 0000000000000604 t triangle_ccw
Nashimus tested the above lp_setup_tri.c
patch in OpenArena with different thread counts:
16 threads: 3398 frames 312.5 seconds 10.9 fps 55.0/92.0/8375.0/21.2 ms 16 threads and SIMD patch: 3398 frames 309.1 seconds 11.0 fps 54.0/91.0/8421.0/20.9 ms 64 threads: 3398 frames 221.5 seconds 15.3 fps 33.0/65.2/8390.0/21.4 ms 64 threads and SIMD patch: 3398 frames 222.3 seconds 15.3 fps 35.0/65.4/8464.0/21.4 ms 144 threads: 3398 frames 208.5 seconds 16.3 fps 30.0/61.4/8364.0/21.8 ms 144 threads and SIMD patch: 3398 frames 208.7 seconds 16.3 fps 30.0/61.4/8327.0/21.8 ms
Jeremy obtained the follow apitrace benchmarks, indicating that the patch does not appear to produce any measurable performance change in OpenArena:
$ ./benchmark-apitrace-openarena.sh Benchmarks should take around 4 hours to run, be patient! CPU(s): 32 Memory: WARNING: you should run this program as super-user. size: 8GiB size: 64GiB size: 16GiB size: 64GiB size: 8GiB size: 64GiB size: 16GiB size: 8GiB WARNING: output may be incomplete or inaccurate, you should run this program as super-user. OpenGL version string: 4.5 (Compatibility Profile) Mesa 22.2.3 LP_MAX_THREADS: 16 16 Threads: control Rendered 3436 frames in 248.303 secs, average of 13.8379 fps experimental Rendered 3436 frames in 247.75 secs, average of 13.8688 fps control Rendered 3436 frames in 247.058 secs, average of 13.9077 fps experimental Rendered 3436 frames in 248.886 secs, average of 13.8055 fps control Rendered 3436 frames in 247.324 secs, average of 13.8927 fps experimental Rendered 3436 frames in 248.16 secs, average of 13.8459 fps control Rendered 3436 frames in 247.34 secs, average of 13.8918 fps experimental Rendered 3436 frames in 247.173 secs, average of 13.9012 fps control Rendered 3436 frames in 246.754 secs, average of 13.9248 fps experimental Rendered 3436 frames in 246.935 secs, average of 13.9146 fps control Rendered 3436 frames in 246.852 secs, average of 13.9193 fps experimental Rendered 3436 frames in 247.166 secs, average of 13.9016 fps control Rendered 3436 frames in 247.431 secs, average of 13.8867 fps experimental Rendered 3436 frames in 247.036 secs, average of 13.9089 fps control Rendered 3436 frames in 245.538 secs, average of 13.9937 fps experimental Rendered 3436 frames in 256.544 secs, average of 13.3934 fps control Rendered 3436 frames in 246.85 secs, average of 13.9194 fps experimental Rendered 3436 frames in 246.722 secs, average of 13.9266 fps control Rendered 3436 frames in 247.268 secs, average of 13.8958 fps experimental Rendered 3436 frames in 248.315 secs, average of 13.8372 fps
lp_linear_fastpath.c
It would be useful to find test cases that exercise lp_linear_fastpath.c
(2D fastpath of LLVMpipe), so that we can run benchmarks of porting it to POWER9.
The fastpath does not appear to be exercised by Falcon-mkxp as of Fedora 36 on x86_64 according to tests by Jeremy.
According to DemiMarie and Dave Airlie, the fastpath may be exercised by GTK4 apps, but is not currently exercised by GNOME Shell [3]:
00:49 Jeremy_Rand_Talos__: Is there a recommended way to benchmark the LLVMpipe 2D fastpath that was introduced in MR !11969 ? Preferably a GNU/Linux application that I can run through apitrace.
00:52 Jeremy_Rand_Talos__: The MR text mentions various Windows things (I prefer Linux), desktop compositors (seems hard to run through apitrace?), HTML browsers (ditto due to sandboxing), etc. I assume I can't be the first person who wants to do such testing, so presumably there's already more info about what apps are suitable that I failed to find?
01:57 DemiMarie: GTK4 apps
02:51 airlied: Jeremy_Rand_Talos: not really, lots of things use it for some paths like blits and copies
02:52 airlied: id like to get gnome shell using it, but depth buffers stops it
04:54 DemiMarie: GTK4 uses OpenGL for rendering
Other linear rasterizer functions
Jeremy found that GNOME Maps (in Fedora 37) does exercise some of the linear rasterizer functions, so it may be a candidate for benchmarking LLVMpipe SIMD improvements. However, its usage seems relatively small, so it may not yield a large performance bump.
Strangely, on Fedora 37 x86_64, the various lp_linear
functions show up in perf
output when running perf
directly on GNOME Maps, but they do not show up when replaying an apitrace. This probably poses a problem for benchmarking.
Other functions
It would be desirable to create and test similar patches for SSE-based non-Altivec functions added between Debian Bullseye and current main
.
Nashimus tested File:LLVMpipe-SIMD-bookworm.patch in OpenArena with different thread counts. Starting from Mesa commit b91971c2 on Ubuntu 22.04:
32 threads (14.92 avg fps): 3398 frames 224.7 seconds 15.1 fps 36.0/66.1/2483.0/14.4 ms 3398 frames 229.7 seconds 14.8 fps 38.0/67.6/1533.0/15.1 ms 3398 frames 227.6 seconds 14.9 fps 39.0/67.0/1522.0/13.9 ms 3398 frames 227.9 seconds 14.9 fps 39.0/67.1/1533.0/14.1 ms 3398 frames 227.7 seconds 14.9 fps 39.0/67.0/1551.0/14.2 ms 32 threads and SIMD patch (14.84 avg fps): 3398 frames 230.6 seconds 14.7 fps 40.0/67.9/2080.0/18.1 ms 3398 frames 230.1 seconds 14.8 fps 39.0/67.7/1711.0/14.7 ms 3398 frames 227.0 seconds 15.0 fps 38.0/66.8/1485.0/14.7 ms 3398 frames 230.2 seconds 14.8 fps 36.0/67.7/1482.0/14.8 ms 3398 frames 228.7 seconds 14.9 fps 40.0/67.3/1522.0/14.6 ms 144 threads (17.68 avg fps): 3398 frames 193.9 seconds 17.5 fps 31.0/57.1/2141.0/16.1 ms 3398 frames 191.3 seconds 17.8 fps 30.0/56.3/1743.0/14.4 ms 3398 frames 191.5 seconds 17.7 fps 30.0/56.4/1526.0/14.6 ms 3398 frames 192.3 seconds 17.7 fps 30.0/56.6/1528.0/14.0 ms 3398 frames 191.8 seconds 17.7 fps 29.0/56.4/1545.0/13.7 ms 144 threads and SIMD patch (17.42 avg fps): 3398 frames 206.8 seconds 16.4 fps 29.0/60.9/8731.0/21.7 ms 3398 frames 195.5 seconds 17.4 fps 30.0/57.5/1710.0/14.9 ms 3398 frames 190.6 seconds 17.8 fps 30.0/56.1/1538.0/14.6 ms 3398 frames 191.6 seconds 17.7 fps 31.0/56.4/1587.0/14.5 ms 3398 frames 191.4 seconds 17.8 fps 31.0/56.3/1510.0/13.6 ms