Difference between revisions of "Checkstop"
JeremyRand (talk | contribs) (→(NCUFIR[11]) NCU no response to snooped TLBIE: Add rebased HCODE branch) |
JeremyRand (talk | contribs) (→(NCUFIR[11]) NCU no response to snooped TLBIE: Clarify when Raptor repo was checked) |
||
Line 32: | Line 32: | ||
== (NCUFIR[11]) NCU no response to snooped TLBIE == | == (NCUFIR[11]) NCU no response to snooped TLBIE == | ||
− | This is a firmware bug that was already [https://delivery04.dhe.ibm.com/sar/CMA/SFA/09zs6/0/9006-12p-22p-OpenPowerReadme.op920.41.xhtml#__RefHeading___Toc5321_1053759979 fixed by IBM PNOR v2.18]. According to [https://github.com/open-power/hostboot/issues/220 Hostboot issue 220], the fix was in [[HCODE]] commit [https://github.com/open-power/hcode/commit/9eb379569ffc1ae192aaa82bba43b25a051633b4 9eb379569ffc1ae192aaa82bba43b25a051633b4] ("CME: big core workaround for field TLBIE xstop", committed 2021 March 23). Unfortunately that fix has not yet been merged to Raptor's HCODE repository. | + | This is a firmware bug that was already [https://delivery04.dhe.ibm.com/sar/CMA/SFA/09zs6/0/9006-12p-22p-OpenPowerReadme.op920.41.xhtml#__RefHeading___Toc5321_1053759979 fixed by IBM PNOR v2.18]. According to [https://github.com/open-power/hostboot/issues/220 Hostboot issue 220], the fix was in [[HCODE]] commit [https://github.com/open-power/hcode/commit/9eb379569ffc1ae192aaa82bba43b25a051633b4 9eb379569ffc1ae192aaa82bba43b25a051633b4] ("CME: big core workaround for field TLBIE xstop", committed 2021 March 23). Unfortunately that fix has not yet been merged to Raptor's HCODE repository as of 2023 June 17. |
[[User:JeremyRand|JeremyRand]] has confirmed that rebasing Raptor's HCODE against current IBM HCODE fixes the bug. A [https://github.com/JeremyRand/hcode/tree/talos-2019-07-25-master-rebased pre-rebased branch] of HCODE is available, for users who want to get the patch before a Raptor release. | [[User:JeremyRand|JeremyRand]] has confirmed that rebasing Raptor's HCODE against current IBM HCODE fixes the bug. A [https://github.com/JeremyRand/hcode/tree/talos-2019-07-25-master-rebased pre-rebased branch] of HCODE is available, for users who want to get the patch before a Raptor release. | ||
[[Category:Troubleshooting]] | [[Category:Troubleshooting]] |
Latest revision as of 01:02, 17 June 2023
Checkstop (xstop): An error that results in the system being forcibly rebooted by the firmware.
Contents
Diagnosing a Checkstop
There are a few ways to obtain logs of a checkstop.
nvram
From either the OS or Skiroot, run this as root/sudo after the machine has force-rebooted following the checkstop (but before rebooting again):
nvram --unzip lnx,oops-log
If you're lucky, it will return a log of the most recent checkstop. If you instead get nvram: ERROR: can't decompress text: inflate() returned -3
, then the log in NVRAM is corrupted for some reason, and you'll need to try a different approach.
opal-prd
Before the checkstop occurs, run the following from the OS (this is for Debian; most other distros package it as well; see your distro's documentation for details):
sudo apt install opal-prd
Once installed, if you're lucky, any subsequent checkstops should show up in journalctl
output.
Client Console
While the checkstop occurs, be connected to the BMC Client Console from another machine. During the subsequent forced reboot, Hostboot will print a log of the checkstop.
Known Checkstop Issues
(NCUFIR[11]) NCU no response to snooped TLBIE
This is a firmware bug that was already fixed by IBM PNOR v2.18. According to Hostboot issue 220, the fix was in HCODE commit 9eb379569ffc1ae192aaa82bba43b25a051633b4 ("CME: big core workaround for field TLBIE xstop", committed 2021 March 23). Unfortunately that fix has not yet been merged to Raptor's HCODE repository as of 2023 June 17.
JeremyRand has confirmed that rebasing Raptor's HCODE against current IBM HCODE fixes the bug. A pre-rebased branch of HCODE is available, for users who want to get the patch before a Raptor release.