Difference between revisions of "Troubleshooting/Guard Partition"

From RCS Wiki
Jump to navigation Jump to search
(Add page on clearing the guard partition)
 
(Skiroot shell works too)
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
If some components (e.g. a CPU or some cores on a CPU) are not being detected, they may have been guarded out.  This is a mechanism used to allow POWER systems to function when broken components are detected, but if a component is incorrectly detected as broken (or if it really is broken but is later fixed), it can prevent the component from working until the spurious guard entry is manually cleared.
 
If some components (e.g. a CPU or some cores on a CPU) are not being detected, they may have been guarded out.  This is a mechanism used to allow POWER systems to function when broken components are detected, but if a component is incorrectly detected as broken (or if it really is broken but is later fixed), it can prevent the component from working until the spurious guard entry is manually cleared.
  
To clear the guard partition (and thereby force the system to try those components again on next boot), issue <code>pflash -P GUARD -c</code> from the BMC shell.
+
To clear the guard partition (and thereby force the system to try those components again on next boot), issue <code>pflash -P GUARD -c</code> from either the BMC shell or the Skiroot shell.
 +
 
 +
'''Note:'''
 +
CPUs being guarded out ''might'' not be a rare occurrence. It has been reported [https://www.talospace.com/2020/05/the-case-of-disappearing-core.html here] and [http://tenfourfox.blogspot.com/2018/05/a-semi-review-of-raptor-talos-ii.html here] for example. Which also could mean that it is "dialed-in" to be very safe. More insight into the mechanics in this wiki would be appreciated.

Latest revision as of 09:26, 24 December 2022

If some components (e.g. a CPU or some cores on a CPU) are not being detected, they may have been guarded out. This is a mechanism used to allow POWER systems to function when broken components are detected, but if a component is incorrectly detected as broken (or if it really is broken but is later fixed), it can prevent the component from working until the spurious guard entry is manually cleared.

To clear the guard partition (and thereby force the system to try those components again on next boot), issue pflash -P GUARD -c from either the BMC shell or the Skiroot shell.

Note: CPUs being guarded out might not be a rare occurrence. It has been reported here and here for example. Which also could mean that it is "dialed-in" to be very safe. More insight into the mechanics in this wiki would be appreciated.