Difference between revisions of "Checkstop"

From RCS Wiki
Jump to navigation Jump to search
(Known Checkstop Issues)
Line 27: Line 27:
 
== (NCUFIR[11]) NCU no response to snooped TLBIE ==
 
== (NCUFIR[11]) NCU no response to snooped TLBIE ==
  
This is a firmware bug that was already [https://delivery04.dhe.ibm.com/sar/CMA/SFA/09zs6/0/9006-12p-22p-OpenPowerReadme.op920.41.xhtml#__RefHeading___Toc5321_1053759979 fixed by IBM PNOR v2.18]. Unfortunately that fix has not yet been merged by Raptor.
+
This is a firmware bug that was already [https://delivery04.dhe.ibm.com/sar/CMA/SFA/09zs6/0/9006-12p-22p-OpenPowerReadme.op920.41.xhtml#__RefHeading___Toc5321_1053759979 fixed by IBM PNOR v2.18]. According to [https://github.com/open-power/hostboot/issues/220 Hostboot issue 220], the fix was in [[HCODE]] commit [https://github.com/open-power/hcode/commit/9eb379569ffc1ae192aaa82bba43b25a051633b4 9eb379569ffc1ae192aaa82bba43b25a051633b4] ("CME: big core workaround for field TLBIE xstop", committed 2021 March 23). Unfortunately that fix has not yet been merged to Raptor's HCODE repository.

Revision as of 13:44, 3 March 2023

Diagnosing a Checkstop

There are a few ways to obtain logs of a checkstop.

nvram

From either the OS or Skiroot, run this as root/sudo after the machine has force-rebooted following the checkstop (but before rebooting again):

nvram --unzip lnx,oops-log

If you're lucky, it will return a log of the most recent checkstop. If you instead get nvram: ERROR: can't decompress text: inflate() returned -3, then the log in NVRAM is corrupted for some reason, and you'll need to try a different approach.

opal-prd

Before the checkstop occurs, run the following from the OS (this is for Debian; most other distros package it as well; see your distro's documentation for details):

sudo apt install opal-prd

Once installed, if you're lucky, any subsequent checkstops should show up in journalctl output.

Client Console

While the checkstop occurs, be connected to the BMC Client Console from another machine. During the subsequent forced reboot, Hostboot will print a log of the checkstop.

Known Checkstop Issues

(NCUFIR[11]) NCU no response to snooped TLBIE

This is a firmware bug that was already fixed by IBM PNOR v2.18. According to Hostboot issue 220, the fix was in HCODE commit 9eb379569ffc1ae192aaa82bba43b25a051633b4 ("CME: big core workaround for field TLBIE xstop", committed 2021 March 23). Unfortunately that fix has not yet been merged to Raptor's HCODE repository.