Porting/ROCm
I am trying to get rocminfo working on Debian 13 on ppc64le:
2026-02-01: Made progress, documented here: https://www.fitzsim.org/blog/?p=797
Now struggling to figure out how to access the AMD Radeon AI Pro R9700 from within a libvirt virtual machine.
I had to rebuild the Debian 6.17.13 kernel with my hack (see blog post) but also with extra CONFIG*VFIO*=m and CONFIG*VIRTIO*=m stuff so that I could test the same kernel on the host and the guest.
The following procedure reliably recovers from:
[ 5.068999] amdgpu 0001:00:01.0: amdgpu: VRAM: 32624M 0x0000008000000000 - 0x00000087F6FFFFFF (32624M used) [ 5.069005] amdgpu 0001:00:01.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF [ 5.069010] [drm] Detected VRAM RAM=32624M, BAR=32768M [ 5.069013] [drm] RAM width 256bits GDDR6 [ 5.069027] amdgpu 0001:00:01.0: lsa_required: 0, lsa_enabled: 0, direct mapping: 0 [ 5.069030] amdgpu 0001:00:01.0: dma_iommu_get_required_mask: returning bypass mask 0x1fffffffff [ 5.070109] amdgpu 0001:00:01.0: amdgpu: amdgpu: 32624M of VRAM memory ready [ 5.070114] amdgpu 0001:00:01.0: amdgpu: amdgpu: 64432M of GTT memory ready. [ 5.070178] [drm] GART: num cpu pages 131072, num gpu pages 131072 [ 5.070290] amdgpu 0001:00:01.0: amdgpu: PCIE GART of 512M enabled (table at 0x00000087D6B00000). [ 5.082027] amdgpu 0001:00:01.0: amdgpu: [drm] Loading DMUB firmware via PSP: version=0x0A000700 [ 5.138168] amdgpu 0001:00:01.0: amdgpu: Found VCN firmware Version ENC: 1.11 DEC: 9 VEP: 0 Revision: 1 [ 7.738540] amdgpu 0001:00:01.0: amdgpu: PSP load kdb failed! [ 7.927776] amdgpu 0001:00:01.0: amdgpu: psp reg (0x16080) wait timed out, mask: 8000ffff, read: 30000 exp: 80000000 [ 7.927872] [drm:psp_v14_0_ring_destroy [amdgpu]] *ERROR* Fail to stop psp ring [ 7.928173] amdgpu 0001:00:01.0: amdgpu: PSP firmware loading failed [ 7.928211] amdgpu 0001:00:01.0: amdgpu: hw_init of IP block <psp> failed -22 [ 7.928252] amdgpu 0001:00:01.0: amdgpu: amdgpu_device_ip_init failed [ 7.928289] amdgpu 0001:00:01.0: amdgpu: Fatal error during GPU init [ 7.928327] amdgpu 0001:00:01.0: amdgpu: amdgpu: finishing device. [ 7.928477] ------------[ cut here ]------------ [ 7.928478] WARNING: CPU: 2 PID: 230 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:639 amdgpu_irq_put+0x90/0x1a8 [amdgpu]
/etc/modprobe.d/vfio.conf:
- softdep amdgpu pre: vfio-pci
- softdep snd_hda_intel pre: vfio-pci
- options vfio-pci ids=1002:7551,1002:ab40
The card's state is very "sticky"; if anything goes wrong, you probably need to "shutdown now", then on BMC: "obmcutil state" (wait for Off states") then "obmcutil poweron". In some bad states, "reboot" DOES NOT GET THE CARD BACK, which is quite annoying.
With those vfio.conf lines commented out, this will reliably get the card back to normal.
Then uncomment those vfio.conf lines, and "reboot". Then:
lspci -nnk will show "vfio" in its output, and dmesg will show vfio capturing those two PCIe IDs of the graphics and sound PCIe endpoints of the card.
In virt-manager GUI I "Add PCI host" hardware for the two endpoints. Then start the VM. "virsh console <vmname>; login as root" => rocminfo shows the card, which is nice. However, the VM hard-hangs, with no "dmesg -w" output, and the qemu process strace on the host shows in a tight loop just "ioctl" of something like DIRTY_LOG_KMS (didn't capture precise output).
That's where I'm at with "GPU passthrough" support for this card so far.
2026-01-12: Even with the latest firmware linux-firmware-upstream_20260110-2-gfd647379_all.deb, on 6.17.13+deb14-powerpc64le-64k, there is an issue parsing the Virtual Component Resource Association Table (VCRAT) that results in Kernel Fusion Driver (kfd) not being able to add the GPU to the topology, so rocminfo cannot find it.
dmesg shows:
[............] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [............] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1 [............] amdgpu: Virtual CRAT table created for GPU [............] amdgpu 0033:03:00.0: amdgpu: Error parsing VCRAT [............] kfd kfd: amdgpu: Error adding device to topology [............] kfd kfd: amdgpu: Error initializing KFD node [............] kfd kfd: amdgpu: device 1002:7551 NOT added due to errors
I'm not sure if I should report this as a bug somewhere, or maybe try on the absolute latest kernel...
2026-01-11: The segfault was caused by vDSO detection failing. Fixed by this patch:
https://www.fitzsim.org/patches/0001-rocr-Fix-vDSO-detection-on-ppc64-architectures-in-os.patch
2026-01-06: It segfaults:
ii rocminfo 6.1.2-2 ppc64el ROCm Application for Reporting System Info # rocminfo ROCk module is loaded Segmentation fault