Difference between revisions of "Troubleshooting/GPU"

From RCS Wiki
Jump to navigation Jump to search
Line 21: Line 21:
 
Installing more than one GPU into an OpenPOWER system (for instance, when adding a discrete GPU) exposes all GPUs directly to the operating system -- there is no concept of a "primary" GPU like there is on x86.  Xorg does not handle this gracefully, tending to crash during autoconfiguration.  [https://bugs.freedesktop.org/show_bug.cgi?id=94166 At least one bug report has been filed] but fixing the root cause of this issue (incorrect Xorg drivers binding to underlying DRM devices) does not seem to be an Xorg priority.  Furthermore, Xorg does not properly handle domains during autoconfiguration per [https://bugs.freedesktop.org/show_bug.cgi?id=98524#c2 another bug report on a similar issue].  Community effort in getting proper fixes into Xorg would be very useful, as the Xorg developers may want to see that more than one or two systems are impacted by these bugs before working on resolving them.
 
Installing more than one GPU into an OpenPOWER system (for instance, when adding a discrete GPU) exposes all GPUs directly to the operating system -- there is no concept of a "primary" GPU like there is on x86.  Xorg does not handle this gracefully, tending to crash during autoconfiguration.  [https://bugs.freedesktop.org/show_bug.cgi?id=94166 At least one bug report has been filed] but fixing the root cause of this issue (incorrect Xorg drivers binding to underlying DRM devices) does not seem to be an Xorg priority.  Furthermore, Xorg does not properly handle domains during autoconfiguration per [https://bugs.freedesktop.org/show_bug.cgi?id=98524#c2 another bug report on a similar issue].  Community effort in getting proper fixes into Xorg would be very useful, as the Xorg developers may want to see that more than one or two systems are impacted by these bugs before working on resolving them.
  
Fortunately, the workaround is fairly simple, and consists of explicitly assigning Xorg drivers for each installed GPU.  For this example we'll show how to fix Xorg on Debian with an AMD WX7100 discrete GPU installed.
+
Two workarounds are available:
  
==== Step 1: Locate Bus Numbers ====
+
==== Workaround 1: Disable on-board VGA====
 +
Disable the on-board VGA output via the VGA disable jumper, J10109.  See the [[:File:T2P9D01 users guide version 1 0.pdf|Users Guide]] for additional information.
 +
 
 +
==== Workaround 2: Select desired GPU at runtime ====
 +
 
 +
The workaround to keep both devices active, or to retain the ability to switch in the active operating system, is fairly simple, and consists of explicitly assigning Xorg drivers for each installed GPU.  For this example we'll show how to fix Xorg on Debian with an AMD WX7100 discrete GPU installed.
 +
 
 +
===== Step 1: Locate Bus Numbers =====
 
  <nowiki>root@talos:~# lspci | grep VGA
 
  <nowiki>root@talos:~# lspci | grep VGA
 
0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon Pro WX 7100]
 
0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon Pro WX 7100]
Line 30: Line 37:
 
Note the numbers to the left of the "VGA compatible controller" string.  Each of these numbers is the PCI d:B:D.F<ref group="note">PCI Domain:Bus:Device.Function</ref> number of the GPU, and is unique to the slot(s) you have your GPU(s) installed in.  As a result of this slot dependence, bus IDs may differ from those shown in this example; always use your bus IDs when following the steps below.  This slot dependence means that if you move your GPU to a different slot you will need to update the bus ID associated with that GPU.
 
Note the numbers to the left of the "VGA compatible controller" string.  Each of these numbers is the PCI d:B:D.F<ref group="note">PCI Domain:Bus:Device.Function</ref> number of the GPU, and is unique to the slot(s) you have your GPU(s) installed in.  As a result of this slot dependence, bus IDs may differ from those shown in this example; always use your bus IDs when following the steps below.  This slot dependence means that if you move your GPU to a different slot you will need to update the bus ID associated with that GPU.
  
==== Step 2: Create Xorg Configuration Snippet ====
+
===== Step 2: Create Xorg Configuration Snippet =====
 
  <nowiki>root@talos:~# mkdir /etc/X11/xorg.conf.d</nowiki>
 
  <nowiki>root@talos:~# mkdir /etc/X11/xorg.conf.d</nowiki>
  
Line 55: Line 62:
 
  <nowiki>root@talos:~# modprobe amdgpu</nowiki>
 
  <nowiki>root@talos:~# modprobe amdgpu</nowiki>
  
==== Step 3 (optional): Disable Integrated Video ====
+
===== Step 3 (optional): Disable Integrated Video =====
  
 
If you want all Xorg output to be directed to your discrete GPU(s), you may want to disable the integrated VGA video output as much as possible.  To do this, delete the ASpeed block from your Xorg configuration snippet, and blacklist the <code>ast</code> driver.  To blacklist the <code>ast</code> driver on Debian based systems, create a new file <code>/etc/modprobe.d/ast-blacklist.conf</code> and place the following line inside the new file:
 
If you want all Xorg output to be directed to your discrete GPU(s), you may want to disable the integrated VGA video output as much as possible.  To do this, delete the ASpeed block from your Xorg configuration snippet, and blacklist the <code>ast</code> driver.  To blacklist the <code>ast</code> driver on Debian based systems, create a new file <code>/etc/modprobe.d/ast-blacklist.conf</code> and place the following line inside the new file:
Line 68: Line 75:
 
=== Bootloader does not show up on monitor(s) attached to a discrete GPU ===
 
=== Bootloader does not show up on monitor(s) attached to a discrete GPU ===
  
Most modern discrete GPUs require firmware.  As Talos™ II is aimed at a security-conscious audience, we do not currently include GPU firmware in the production firmware images.  We are currently researching possible methods for allowing GPU firmware to be added to the firmware images at a later date by the end user.
+
Most modern discrete GPUs require firmware.  As Talos™ II is aimed at a security-conscious audience, we do not currently include GPU firmware in the production firmware images.  Instructions are available in the [[:File:T2P9D01 users guide version 1 0.pdf|Users Guide]] to add firmware for your GPU to the PNOR if needed.  Note that any added firmware may be able to access and modify data associated with the affected device(s); we strongly recommend you perform a security risk analysis before loading any firmware, and select open firmware where/if it is available.
 +
 
 +
If you are using a GPU that does not require firmware, or have already added any needed firmware files to the host PNOR, please ensure that the on-board VGA disable jumper is capped.  The bootloader output will preferentially show up on the on-board VGA port if it remains enabled.
  
For now, we recommend that you either use a serial console or VGA monitor / adapter to interact with the bootloader.
+
Alternatively, you either use a serial console or VGA monitor / adapter to interact with the bootloader.
  
 
== Notes ==
 
== Notes ==
  
 
<references group="note"/>
 
<references group="note"/>

Revision as of 16:30, 22 April 2018


Background

Because OpenPOWER systems do not have a legacy graphics interface to fall back to, and as a result rely heavily on the running operating system and its drivers to handle display tasks, a few rough edges are exposed. This page attempts to document the current status of these rough edges and suggested workarounds pending actual fixes.

Common Issues

My AMD GPU works in petitboot but not the subsequent Linux OS

Older versions of the amdgpu driver (Linux 4.15 and below) have a bug where the connected outputs will not re-initialize after a kexec() while the driver is loaded. Kernel 4.16 and above does not appear to have this problem.

If you need to use kernel 4.15 or below, you can work around this issue by either:

  • Enabling and using the VGA video output to access the bootloader (petitboot) -or-
  • using a serial connection to control petitboot, and running the following commands prior to selecting an operating system via the petitboot menu:
echo 0 > /sys/class/vtconsole/vtcon1/bind
rmmod amdgpu

Xorg will not start / crashes when a discrete GPU is installed

Installing more than one GPU into an OpenPOWER system (for instance, when adding a discrete GPU) exposes all GPUs directly to the operating system -- there is no concept of a "primary" GPU like there is on x86. Xorg does not handle this gracefully, tending to crash during autoconfiguration. At least one bug report has been filed but fixing the root cause of this issue (incorrect Xorg drivers binding to underlying DRM devices) does not seem to be an Xorg priority. Furthermore, Xorg does not properly handle domains during autoconfiguration per another bug report on a similar issue. Community effort in getting proper fixes into Xorg would be very useful, as the Xorg developers may want to see that more than one or two systems are impacted by these bugs before working on resolving them.

Two workarounds are available:

Workaround 1: Disable on-board VGA

Disable the on-board VGA output via the VGA disable jumper, J10109. See the Users Guide for additional information.

Workaround 2: Select desired GPU at runtime

The workaround to keep both devices active, or to retain the ability to switch in the active operating system, is fairly simple, and consists of explicitly assigning Xorg drivers for each installed GPU. For this example we'll show how to fix Xorg on Debian with an AMD WX7100 discrete GPU installed.

Step 1: Locate Bus Numbers
root@talos:~# lspci | grep VGA
0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon Pro WX 7100]
0005:02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)

Note the numbers to the left of the "VGA compatible controller" string. Each of these numbers is the PCI d:B:D.F[note 1] number of the GPU, and is unique to the slot(s) you have your GPU(s) installed in. As a result of this slot dependence, bus IDs may differ from those shown in this example; always use your bus IDs when following the steps below. This slot dependence means that if you move your GPU to a different slot you will need to update the bus ID associated with that GPU.

Step 2: Create Xorg Configuration Snippet
root@talos:~# mkdir /etc/X11/xorg.conf.d

Create and open /etc/X11/xorg.conf.d/21-gpu-driver.conf for editing, then adjust the following template with your GPU information. Pay close attention to the BusID and Driver fields, as they must match your installed GPU(s). Note that Xorg uses decimal numbering, not hexadecimal like lspci, so you will need to convert the numbers within the colons of the lspci output to decimal in order to constrict a valid Xorg BusID. Furthermore, xorg doesn't use leading zeroes like lspci does; these must be stripped off when assembling the Xorg BusID. Finally, Xorg expects to see a BusID assembled as "PCI:B@d:D:F" (note Bus and Domain are swapped), and should not be assembled not using the format shown by lspci.

# AST2500
Section "Device"
    Identifier     "GPU0"
    Driver         "modesetting"
    BusID          "PCI:2@5:0:0"
    VendorName     "ASpeed Corporation"
EndSection

# WX7100
Section "Device"
    Identifier     "GPU1"
    Driver         "amdgpu"
    BusID          "PCI:1@0:0:0"
    VendorName     "AMD Corporation"
EndSection

Save and exit the configuration snippet file, then restart Xorg. Your GPUs should now function as intended. If Xorg still does not start, make sure that the appropriate kernel driver (such as amdgpu in the example above) has been loaded:

root@talos:~# modprobe amdgpu
Step 3 (optional): Disable Integrated Video

If you want all Xorg output to be directed to your discrete GPU(s), you may want to disable the integrated VGA video output as much as possible. To do this, delete the ASpeed block from your Xorg configuration snippet, and blacklist the ast driver. To blacklist the ast driver on Debian based systems, create a new file /etc/modprobe.d/ast-blacklist.conf and place the following line inside the new file:

blacklist ast

You may need to reboot if the ast DRM driver has already loaded. Alternatively, you may try to unbind and unload the ast driver as follows (assuming the ast driver is bound to vtcon0):

root@talos:~# echo 0 > /sys/class/vtconsole/vtcon0/bind
root@talos:~# rmmod ast

Bootloader does not show up on monitor(s) attached to a discrete GPU

Most modern discrete GPUs require firmware. As Talos™ II is aimed at a security-conscious audience, we do not currently include GPU firmware in the production firmware images. Instructions are available in the Users Guide to add firmware for your GPU to the PNOR if needed. Note that any added firmware may be able to access and modify data associated with the affected device(s); we strongly recommend you perform a security risk analysis before loading any firmware, and select open firmware where/if it is available.

If you are using a GPU that does not require firmware, or have already added any needed firmware files to the host PNOR, please ensure that the on-board VGA disable jumper is capped. The bootloader output will preferentially show up on the on-board VGA port if it remains enabled.

Alternatively, you either use a serial console or VGA monitor / adapter to interact with the bootloader.

Notes

  1. PCI Domain:Bus:Device.Function