Difference between revisions of "Troubleshooting/GPU"
(→Common Issues: added Kernel 5.14 and above) |
JeremyRand (talk | contribs) (→Xorg will not start / crashes when a discrete GPU is installed: Switch to Wayland) |
||
(5 intermediate revisions by 2 users not shown) | |||
Line 58: | Line 58: | ||
Installing more than one GPU into an OpenPOWER system (for instance, when adding a discrete GPU) exposes all GPUs directly to the operating system -- there is no concept of a "primary" GPU like there is on x86. Xorg does not handle this gracefully, tending to crash during autoconfiguration. [https://bugs.freedesktop.org/show_bug.cgi?id=94166 At least one bug report has been filed] but fixing the root cause of this issue (incorrect Xorg drivers binding to underlying DRM devices) does not seem to be an Xorg priority. Furthermore, Xorg does not properly handle domains during autoconfiguration per [https://bugs.freedesktop.org/show_bug.cgi?id=98524#c2 another bug report on a similar issue]. Community effort in getting proper fixes into Xorg would be very useful, as the Xorg developers may want to see that more than one or two systems are impacted by these bugs before working on resolving them. | Installing more than one GPU into an OpenPOWER system (for instance, when adding a discrete GPU) exposes all GPUs directly to the operating system -- there is no concept of a "primary" GPU like there is on x86. Xorg does not handle this gracefully, tending to crash during autoconfiguration. [https://bugs.freedesktop.org/show_bug.cgi?id=94166 At least one bug report has been filed] but fixing the root cause of this issue (incorrect Xorg drivers binding to underlying DRM devices) does not seem to be an Xorg priority. Furthermore, Xorg does not properly handle domains during autoconfiguration per [https://bugs.freedesktop.org/show_bug.cgi?id=98524#c2 another bug report on a similar issue]. Community effort in getting proper fixes into Xorg would be very useful, as the Xorg developers may want to see that more than one or two systems are impacted by these bugs before working on resolving them. | ||
− | + | Three workarounds are available: | |
− | ==== Workaround 1: Disable on-board VGA==== | + | ==== Workaround 1: Switch to Wayland ==== |
+ | |||
+ | Wayland often handles multiple-GPU setups more gracefully than Xorg. | ||
+ | |||
+ | ==== Workaround 2: Disable on-board VGA==== | ||
Disable the on-board VGA output via the VGA disable jumper, J10109. See the [[:File:T2P9D01 users guide version 1 0.pdf|Users Guide]] for additional information. | Disable the on-board VGA output via the VGA disable jumper, J10109. See the [[:File:T2P9D01 users guide version 1 0.pdf|Users Guide]] for additional information. | ||
− | ==== Workaround | + | ==== Workaround 3: Select desired GPU at runtime ==== |
The workaround to keep both devices active, or to retain the ability to switch in the active operating system, is fairly simple, and consists of explicitly assigning Xorg drivers for each installed GPU. For this example we'll show how to fix Xorg on Debian with an AMD WX7100 discrete GPU installed. | The workaround to keep both devices active, or to retain the ability to switch in the active operating system, is fairly simple, and consists of explicitly assigning Xorg drivers for each installed GPU. For this example we'll show how to fix Xorg on Debian with an AMD WX7100 discrete GPU installed. | ||
Line 139: | Line 143: | ||
=== Xorg crashes or is laggy with the AST VGA GPU === | === Xorg crashes or is laggy with the AST VGA GPU === | ||
− | Xorg seems to enable GLAMOR by default on many operating systems (such as Debian Buster). GLAMOR is a translation layer that converts 2D graphics operations to 3D graphics operations. This makes sense when 3D GPU acceleration is available, but when using a simple unaccelerated 2D GPU like the AST VGA GPU, the result is that 2D operations get converted to 3D operations by GLAMOR and are then converted back to 2D by llvmpipe, which introduces significant overhead. | + | Xorg seems to enable GLAMOR by default on many older operating systems (such as Testing versions of Debian Buster). GLAMOR is a translation layer that converts 2D graphics operations to 3D graphics operations. This makes sense when 3D GPU acceleration is available, but when using a simple unaccelerated 2D GPU like the AST VGA GPU, the result is that 2D operations get converted to 3D operations by GLAMOR and are then converted back to 2D by llvmpipe, which introduces significant overhead. |
− | In addition, on Debian Buster, GLAMOR has been observed to crash when used in conjunction with llvmpipe. | + | In addition, on Testing versions of Debian Buster, GLAMOR has been observed to crash when used in conjunction with llvmpipe. |
You can disable GLAMOR by saving the following text file as <code>/usr/share/X11/xorg.conf.d/00-noglamoregl.conf</code>: | You can disable GLAMOR by saving the following text file as <code>/usr/share/X11/xorg.conf.d/00-noglamoregl.conf</code>: | ||
Line 155: | Line 159: | ||
EndSection | EndSection | ||
− | This bug was | + | This bug was fixed by [https://gitlab.freedesktop.org/xorg/xserver/-/commit/1e3c5d614ee33d9eac1d2cf6366feeb8341fc0f4 commit 1e3c5d614ee33d9eac1d2cf6366feeb8341fc0f4] in upstream Xorg, which was first tagged as 1.20.2; the fixed version first entered Debian Buster on [https://tracker.debian.org/news/999532/xorg-server-21203-1-migrated-to-testing/ 31 Oct 2018]. |
=== KDE is laggy with the AST VGA GPU === | === KDE is laggy with the AST VGA GPU === | ||
Line 165: | Line 169: | ||
=== Wayland (GNOME) freeze after boot with the AST VGA GPU === | === Wayland (GNOME) freeze after boot with the AST VGA GPU === | ||
− | If you get a grey screen with the mouse pointer frozen, you have to boot the rescue installer (from the install media). Open the file <code>/etc/gdm/custom.conf</code> and uncomment <code>WaylandEnable=false</code>. | + | If you get a grey screen with the mouse pointer frozen, you have to boot the rescue installer (from the install media). Open the file <code>/etc/gdm/custom.conf</code> and uncomment <code>WaylandEnable=false</code> <ref>[https://www.talospace.com/2019/11/fedora-31-mini-review-on-blackbird-and.html Fedora 31 mini-review on the Blackbird and Talos II]</ref>. ClassicHasClass suspects that this is a gdm bug rather than a Wayland bug; there is no upstream bug report yet. |
=== Display stuck at default low resolution with AST HDMI GPU === | === Display stuck at default low resolution with AST HDMI GPU === | ||
Line 213: | Line 217: | ||
=== Kernel 5.14 and above === | === Kernel 5.14 and above === | ||
− | Since Kernel 5.14 you have to set [https://gitlab.freedesktop.org/drm/amd/-/issues/1723 amdgpu.aspm=0] on the kernel command line in grub for some AMD GPUs (like Polaris, Vega, Navi) | + | Since Kernel 5.14.8 you have to set [https://gitlab.freedesktop.org/drm/amd/-/issues/1723 amdgpu.aspm=0] on the kernel command line in grub for some AMD GPUs (like Polaris, Vega, Navi). Patch added since [https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.15.54 Kernel 5.15.54] |
== Notes == | == Notes == |
Latest revision as of 21:48, 8 July 2023
Contents
- 1 Background
- 2 Common Issues
- 2.1 Bootloader does not show up on monitor(s) attached to a discrete GPU
- 2.2 My AMD GPU works in petitboot but not the subsequent Linux OS
- 2.3 I want Petitboot via AST but the subsequent Linux OS console on a discrete GPU
- 2.4 Xorg will not start / crashes when a discrete GPU is installed
- 2.5 Monitor not detected in kernel 4.17+
- 2.6 Xorg crashes or is laggy with the AST VGA GPU
- 2.7 KDE is laggy with the AST VGA GPU
- 2.8 Wayland (GNOME) freeze after boot with the AST VGA GPU
- 2.9 Display stuck at default low resolution with AST HDMI GPU
- 2.10 AMDGPU driver crashes after firmware update
- 2.11 Kernel 5.14 and above
- 3 Notes
- 4 See also
Background
Because OpenPOWER systems do not have a legacy graphics interface to fall back to, and as a result rely heavily on the running operating system and its drivers to handle display tasks, a few rough edges are exposed. This page attempts to document the current status of these rough edges and suggested workarounds pending actual fixes.
Common Issues
Bootloader does not show up on monitor(s) attached to a discrete GPU
Most modern discrete GPUs require firmware. As Talos™ II is aimed at a security-conscious audience, we do not currently include GPU firmware in the production firmware images. Instructions are available in the Users Guide to add firmware for your GPU to the PNOR if needed. Note that any added firmware may be able to access and modify data associated with the affected device(s); we strongly recommend you perform a security risk analysis before loading any firmware, and select open firmware where/if it is available.
If you are using a GPU that does not require firmware, or have already added any needed firmware files to the host PNOR, please ensure that the on-board VGA disable jumper (J10109) is capped. The bootloader output will preferentially show up on the on-board VGA port if it remains enabled.
Alternatively, you either use a serial console or VGA monitor / adapter to interact with the bootloader.
My AMD GPU works in petitboot but not the subsequent Linux OS
Older versions of the amdgpu driver (Linux 4.15 and below) have a bug where the connected outputs will not re-initialize after a kexec() while the driver is loaded. Kernel 4.16 and above does not appear to have this problem.
If you need to use kernel 4.15 or below, you can work around this issue by either:
- Enabling and using the VGA video output to access the bootloader (petitboot) -or-
- using a serial connection to control petitboot, and running the following commands prior to selecting an operating system via the petitboot menu:
echo 0 > /sys/class/vtconsole/vtcon1/bind rmmod amdgpu
I want Petitboot via AST but the subsequent Linux OS console on a discrete GPU
If you don't want to put GPU firmware in the PNOR but still want Linux tty on the discrete graphics, you'll find that you'll always get output on the AST first no matter what. Blacklisting the ast
module from loading is not sufficient on its own, you will need two kernel boot arguments:
modprobe.blacklist=ast video=offb:off
For example on Ubuntu, this can be accomplished by changing GRUB_CMDLINE_LINUX
in /etc/default/grub
accordingly, like
GRUB_CMDLINE_LINUX="modprobe.blacklist=ast video=offb:off"
and then running update-grub
.
Tell GDM to ignore a GPU
An alternative is to tell gdm (other login manager likely will also work) to ignore the ASPEED first you need to know the PCI bus information for the ASPEED GPU:
root@talos:~# lspci | grep VGA 0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) 0005:02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
Here we want to ignore the ASPEED GPU (you can use same trick if you want to ignore other GPU for compute for instance) create a file in /etc/udev/rules.d/ (72-gdm-ignore-gpus.rules for instance) with the following content:
TAG-="seat", ENV{ID_FOR_SEAT}=="drm-pci-0005_02_00_0" TAG-="seat", ENV{ID_FOR_SEAT}=="graphics-pci-0005_02_00_0"
Don't forget to use proper PCI bus information (here 0005_00_00_0 from lspci)
With this method you can keep the ASPEED GPU for a console GPU while having a graphic session on discret GPU. It can also be use if you want to ignore a GPU for graphic for instance to dedicate a GPU to GPU compute.
Xorg will not start / crashes when a discrete GPU is installed
Installing more than one GPU into an OpenPOWER system (for instance, when adding a discrete GPU) exposes all GPUs directly to the operating system -- there is no concept of a "primary" GPU like there is on x86. Xorg does not handle this gracefully, tending to crash during autoconfiguration. At least one bug report has been filed but fixing the root cause of this issue (incorrect Xorg drivers binding to underlying DRM devices) does not seem to be an Xorg priority. Furthermore, Xorg does not properly handle domains during autoconfiguration per another bug report on a similar issue. Community effort in getting proper fixes into Xorg would be very useful, as the Xorg developers may want to see that more than one or two systems are impacted by these bugs before working on resolving them.
Three workarounds are available:
Workaround 1: Switch to Wayland
Wayland often handles multiple-GPU setups more gracefully than Xorg.
Workaround 2: Disable on-board VGA
Disable the on-board VGA output via the VGA disable jumper, J10109. See the Users Guide for additional information.
Workaround 3: Select desired GPU at runtime
The workaround to keep both devices active, or to retain the ability to switch in the active operating system, is fairly simple, and consists of explicitly assigning Xorg drivers for each installed GPU. For this example we'll show how to fix Xorg on Debian with an AMD WX7100 discrete GPU installed.
Step 1: Locate Bus Numbers
root@talos:~# lspci | grep VGA 0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon Pro WX 7100] 0005:02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
Note the numbers to the left of the "VGA compatible controller" string. Each of these numbers is the PCI d:B:D.F[note 1] number of the GPU, and is unique to the slot(s) you have your GPU(s) installed in. As a result of this slot dependence, bus IDs may differ from those shown in this example; always use your bus IDs when following the steps below. This slot dependence means that if you move your GPU to a different slot you will need to update the bus ID associated with that GPU.
Step 2: Create Xorg Configuration Snippet
root@talos:~# mkdir -p /etc/X11/xorg.conf.d
Create and open /etc/X11/xorg.conf.d/21-gpu-driver.conf
for editing, then adjust the following template with your GPU information. Pay close attention to the BusID and Driver fields, as they must match your installed GPU(s). Note that Xorg uses decimal numbering, not hexadecimal like lspci
, so you will need to convert the numbers within the colons of the lspci output to decimal in order to constrict a valid Xorg BusID. Furthermore, xorg doesn't use leading zeroes like lspci
does; these must be stripped off when assembling the Xorg BusID. Finally, Xorg expects to see a BusID assembled as "PCI:B@d:D:F" (note Bus and Domain are swapped), and should not be assembled not using the format shown by lspci
.
# AST2500 Section "Device" Identifier "GPU0" Driver "modesetting" BusID "PCI:2@5:0:0" VendorName "ASpeed Corporation" EndSection # WX7100 Section "Device" Identifier "GPU1" Driver "modesetting" # or amdgpu if you have xf86-video-amdgpu installed BusID "PCI:1@0:0:0" VendorName "AMD Corporation" EndSection # this is absolutely necessary, it tells xorg which GPU to use for the screen Section "Screen" Identifier "Screen0" Device "GPU1" EndSection
Save and exit the configuration snippet file, then restart Xorg. Your GPUs should now function as intended. If Xorg still does not start, make sure that the appropriate kernel driver (such as amdgpu
for the example above, keep in mind that the Xorg driver and the kernel driver are separate and distinct) has been loaded:
root@talos:~# modprobe amdgpu
You can use the generic modesetting Xorg driver for AMD GPUs, or you can use amdgpu from xf86-video-amdgpu. The generic modesetting driver has been reported to work perfectly fine on a Talos with various GPUs, so there is likely no practical reason to use the driver-specific DDX.
Alternative Xorg configuration using OutputClass and PrimaryGPU
As an alternative to specifying the association between Screen and Device, you can use an OutputClass
section to tell X that the discrete GPU should be used as the primary GPU.
Section "OutputClass" Identifier "AMD discrete GPU" MatchDriver "amdgpu" Option "PrimaryGPU" "yes" EndSection
Step 3 (optional): Disable Integrated Video
To disable the ASpeed VGA in the booted OS completely, you can use the modprobe.blacklist=ast
approach on kernel command line, refer to the "I want Petitboot via AST but the subsequent Linux OS console on a discrete GPU" section above for more information. This method is universal/works on all distributions. The ASpeed VGA will still show up in lspci
afterwards, which is normal, as you haven't disabled the hardware, just the driver.
With this done, it should be possible to remove the device section in the X.Org configuration file for the onboard VGA, but you can also just leave it there if you want, regardless of whether the driver is loaded or not.
There are alternative ways to blacklist the ast
kernel driver. For example on Debian based systems, create a new file /etc/modprobe.d/ast-blacklist.conf
and place the following line inside the new file:
blacklist ast
You may need to reboot if the ast
DRM driver has already loaded. Alternatively, you may try to unbind and unload the ast
driver as follows (assuming the ast
driver is bound to vtcon0):
root@talos:~# echo 0 > /sys/class/vtconsole/vtcon0/bind root@talos:~# rmmod ast
Monitor not detected in kernel 4.17+
Petitboot shows up fine, but there is no output for the host OS. It has been reported as bug 107049, the workaround is to append amdgpu.dc=0
to the kernel parameters. This is often associated with a host dmesg trace of
[drm] Cannot find any crtc or sizes
EDIT 2019-05-02: Seems the dc=0
workaround is not required any more, tested with 5.1-rc7
Xorg crashes or is laggy with the AST VGA GPU
Xorg seems to enable GLAMOR by default on many older operating systems (such as Testing versions of Debian Buster). GLAMOR is a translation layer that converts 2D graphics operations to 3D graphics operations. This makes sense when 3D GPU acceleration is available, but when using a simple unaccelerated 2D GPU like the AST VGA GPU, the result is that 2D operations get converted to 3D operations by GLAMOR and are then converted back to 2D by llvmpipe, which introduces significant overhead.
In addition, on Testing versions of Debian Buster, GLAMOR has been observed to crash when used in conjunction with llvmpipe.
You can disable GLAMOR by saving the following text file as /usr/share/X11/xorg.conf.d/00-noglamoregl.conf
:
Section "Device" Identifier "nogpu" Driver "modesetting" Option "Accelmethod" "none" EndSection Section "Module" Disable "glamoregl" EndSection
This bug was fixed by commit 1e3c5d614ee33d9eac1d2cf6366feeb8341fc0f4 in upstream Xorg, which was first tagged as 1.20.2; the fixed version first entered Debian Buster on 31 Oct 2018.
KDE is laggy with the AST VGA GPU
KDE's default compositor uses OpenGL. This makes sense when 3D GPU acceleration is available, but when using a simple unaccelerated 2D GPU like the AST VGA GPU, the result is that 2D operations get converted to 3D operations by KDE's compositor and are then converted back to 2D by llvmpipe, which introduces significant overhead.
To fix this, go to System Settings
→ Hardware
→ Display and Monitor
→ Compositor
, and select XRender
as the Rendering backend
. You'll probably also want to select Smooth (slower)
as the Scale method
(it's still much faster than OpenGL, and it looks quite a bit better).
Wayland (GNOME) freeze after boot with the AST VGA GPU
If you get a grey screen with the mouse pointer frozen, you have to boot the rescue installer (from the install media). Open the file /etc/gdm/custom.conf
and uncomment WaylandEnable=false
[1]. ClassicHasClass suspects that this is a gdm bug rather than a Wayland bug; there is no upstream bug report yet.
Display stuck at default low resolution with AST HDMI GPU
As of 05/24/2019 upstream Linux kernels do not have driver support for the IT66121FN HDMI transceiver. This is being actively worked by Raptor Computing Systems and the larger ppc64el community. Until support is added, you will need to force the correct resolution in Xorg. The general process for discovering bus IDs etc. is detailed above in the AMD GPU section; you will need to extend the result with custom modelines as shown below:
# AST2500 Section "Device" Identifier "GPU0" Driver "modesetting" BusID "PCI:2@5:0:0" VendorName "ASpeed Corporation" EndSection # configure as appropriate for your monitor -- a standard 1080p screen is assumed below Section "Monitor" Identifier "Monitor0" # Comment the following two lines with a leading # if you want to enable the 1920 x 1200 resolution too HorizSync 30.0-70.0 VertRefresh 50.0-70.0 Modeline "1920x1080" 172.80 1920 2040 2248 2576 1080 1081 1084 1118 -HSync +Vsync # 1920x1200 59.88 Hz (CVT 2.30MA) hsync: 74.56 kHz; pclk: 193.25 MHz Modeline "1920x1200" 193.25 1920 2056 2256 2592 1200 1203 1209 1245 -hsync +vsync EndSection # this is absolutely necessary, it tells xorg which GPU to use for the screen Section "Screen" Identifier "Screen0" Monitor "Monitor0" Device "GPU0" DefaultDepth 24 SubSection "Display" Depth 24 # Prefers by default the full HD resolution but you can switch to 1920 x 1200 in your Linux desktop then. # If you swap the two modeline identifier below 1920 x 1200 will become the preferred resolution. # You also have to comment the HorizSync and VertRefresh lines above since they limit your monitor to full HD! Modes "1920x1080" "1920x1200" EndSubSection EndSection
Note: You can get a new Modeline string in the console with the cvt command, eg. cvt 1920 1200 60 for a resolution of 1920 x 1200 with 60 Hz.
AMDGPU driver crashes after firmware update
The GPU only allows loading one firmware after an ASIC reset, so the firmware used by the skiroot kernel and the host kernel must be the same. See FreeDesktop.org bug 108585 for more details.
Note: This is theoretically fixed in kernel 5.1, see this commit. Needs confirmation.
Kernel 5.14 and above
Since Kernel 5.14.8 you have to set amdgpu.aspm=0 on the kernel command line in grub for some AMD GPUs (like Polaris, Vega, Navi). Patch added since Kernel 5.15.54
Notes
- ↑ PCI Domain:Bus:Device.Function