Hostboot Debug Howto
How to Debug Hostboot
So, you have an abort or crash early in hostboot, and knowing that hostboot is able to generate stack traces, would prefer not to resort to shotgun debugging. Unfortunately, the [[errl utility is not available outside of IBM|] so the traditional eSEL.pl and related tools won't work, but fear not, there is still a (convoluted) way to make this work.
Grab and decode the HBEL entry
Provided PNOR access is working, hostboot will save the crash information to the HBEL partition. Since we have no tooling to parse this partition, it's easiest to debug a reproducible crash by wiping the HBEL contents:
root@bmc# pflash -P HBEL -c
Now attempt to IPL the system once, and wait for it to shut back down. Issue a chassis off command to make sure it stays down at this point.
Read the HBEL partition into a temporary file:
root@bmc# pflash -P HBEL -r /tmp/hbel.dat
The partition is ECC protected, so the first thing you will need to do is remove every ninth byte. Once complete, in between HBEL binary record data, you will see something similar to:
TID: 37 Bad Address: 10004abb6 Exception Type: 40000000 Instruction where it occurred: 0x40639fc0 K:Backtrace for 37: <-0x366D08<-0x40639D98<-0x4063B788<-0x40643784<-0x406438B8<-0x4063C8F0<-0x4063CA84<-0x4063E624<-0x4064B60C<-0x2714
The most recent faulting instruction is to the left of the backtrace string.
Look up offending symbol in table
By itself the backtrace isn't very useful, however there is a symbol table built as part of the normal op-build process called hbicore.syms. You can find it in talos-op-build/output/host/powerpc64le-buildroot-linux-gnu/sysroot/hostboot_build_images/, alongside various other utilities that mostly don't work due to the errl missing proprietary errl binary.
The format appears to be:
You will need to find the symbol range that contains the faulting address in this table, then you can determine the symbol name.