Difference between revisions of "Talos II/Building FAQ"

From RCS Wiki
Jump to navigation Jump to search
 
(26 intermediate revisions by 11 users not shown)
Line 1: Line 1:
 
== Where is the installation manual online? ==
 
== Where is the installation manual online? ==
  
[[File:T2P9D01_users_guide_version_1_0.pdf]]
+
Talos II: [[File:T2P9D01_users_guide_version_1_0.pdf]]
 +
 
 +
Blackbird: [[File:C1P9S01_users_guide_version_1_0.pdf]]
  
 
== My motherboard bag's seal/labels are broken! Has it been compromised? ==
 
== My motherboard bag's seal/labels are broken! Has it been compromised? ==
Line 19: Line 21:
 
=== My case doesn't have holes for some stand-offs! ===
 
=== My case doesn't have holes for some stand-offs! ===
  
Not necessarily a big deal, especially for the top-left where the I/O plate helps hold it in place.
+
That's not necessarily a big deal, especially for the top-left where the I/O plate helps hold it in place.
  
However, note that without stand-offs, you may accidentally bend the board when inserting CPUs, RAM, or other components.
+
However, note that with fewer stand-offs installed, you are bending the board more than usual when inserting CPUs, RAM, or other components. Such bending may damage the board! To prevent that, you can use plastic mainboard stand-offs [https://www.pccables.com/images/STANDOFFS-MOTHERBOARD-STANDARD-MB-PLASTIC-100-PACK.jpg like these] in places where proper stand-offs are missing. As an alternative, one can put a spare stand-off upside down into a screw hole from below, such that the outer thread is facing upwards, and then fixate it by screwing another stand-off on top of it.
Such bending may damage the board!
 
  
 
== CPU/HSF installation ==
 
== CPU/HSF installation ==
 +
 +
=== How far should the HSF screw be tightened? ===
 +
 +
The screw has a hardstop; you turn until you can't.
  
 
=== What is an indium pad? Does the stock HSF include it? ===
 
=== What is an indium pad? Does the stock HSF include it? ===
Line 31: Line 36:
 
4-core and 8-core CPUs do not require them (and do not ship with them).
 
4-core and 8-core CPUs do not require them (and do not ship with them).
 
More powerful CPUs should ship with them if required (TBD whether pre-applied to the HSF, or separately).
 
More powerful CPUs should ship with them if required (TBD whether pre-applied to the HSF, or separately).
 +
 +
=== Should thermal paste be used? ===
 +
 +
The use of thermal paste is '''not''' recommended under any circumstance. The heatsink is attached using an unusual high-pressure mounting system which places over 200lbs of force on the CPU module, making thermal paste unnecessary.
 +
 +
Optionally, an indium pad may be placed between the heatsink and the CPU heatspreader to enhance dissipation. Testing found this to make no difference to temperatures for 4-core and 8-core CPUs. An indium pad is included with 18 and 22-core CPUs and its use is recommended for those CPUs.
  
 
=== Should I remove the label/sticker from the HSF? ===
 
=== Should I remove the label/sticker from the HSF? ===
Line 36: Line 47:
 
No.
 
No.
 
Do not remove the label/sticker, or you will void the warranty of the HSF.
 
Do not remove the label/sticker, or you will void the warranty of the HSF.
 +
 +
=== Can I use 4mm hex driver? ===
 +
Yes, 5/32" = 3.97 mm.
 +
 +
=== Removing the HSF from a CPU with an indium pad ===
 +
The heat emitted during the operation of the CPU may cause the indium pad to stick to the HSF and the CPU. If the HSF is removed, there is a possibility that the CPU and HSF may stick together, only to separate once the HSF has been partially removed. This could cause the CPU to fall downwards (onto the socket) at an angle, which may damage the socket. For this reason, exercise extreme caution when removing the HSF from a CPU with an indium pad which has been run at load.
 +
 +
=== Dust has settled in the CPU socket while I was removing the CPU; what should I do? ===
 +
 +
It should be safe to apply compressed air from a distance.  Start with the nozzle ~12 inches away and approach 0.5 inches at a time, until the force is sufficient to dislodge and remove the dust.  Stop immediately if you see the pins themselves moving; that means you're close enough to damage the socket!  <ref>[https://twitter.com/RaptorCompSys/status/1611162556592275457 Raptor Computing Systems Twitter, 2023 Jan 6]</ref>
  
 
== Front panel I/O ==
 
== Front panel I/O ==
Line 56: Line 77:
 
! Purpose || - || +
 
! Purpose || - || +
 
|-
 
|-
| Fan fail || 8 || 6
+
| Fan fail || 6 || 8
 
|-
 
|-
 
| NIC 2 || 10 || 9
 
| NIC 2 || 10 || 9
Line 77: Line 98:
 
=== The HD activity LED doesn't work! ===
 
=== The HD activity LED doesn't work! ===
  
The integrated Microsemi controller does not report activity.
+
The integrated Microsemi controller does not report activity (yet).  A much-belated SAS controller firmware update from Microsemi is expected by 04/20/2018 to enable this functionality.
  
J10115 should be connected to something to control the HD activity LED.
+
In the interim, J10115 can be connected to other hardware to control the HD activity LED.
  
 
=== What is J10115? ===
 
=== What is J10115? ===
  
 
Something related to HD activity LED. :)
 
Something related to HD activity LED. :)
 +
 +
=== How do I connect the "ALERT" LED mentioned in the "Initial Power-On" section of the manual? ===
 +
 +
The anode of this LED is on pin 7 of the front panel header (erroneously labelled as +3.3V in the Talos II user manual as of the time of this writing.)  If an LED is connected with its anode on pin 7 and its cathode on pin 8, it will reflect the "ALERT" LED status.  This LED is also referred to as the "UID" LED.
 +
 +
Supermicro chassis use a bidirectional LED for this purpose, doing double duty with the Fan Fail LED.
 +
 +
== BMC serial port J7701 ==
 +
When buying the "serial port bracket" you will need one with Intel/TDK (DTK)/Tyan style, not AT/Everex/Gigabyte, see http://pinoutguide.com/Motherboard/rs232_header_pinout.shtml for differences.
 +
Intel is https://iczc.cz/8fi3g7r5amg33a1pjn9tl4v9r8_7/obrazek while the other one is https://iczc.cz/5dessg9ns0ht49fed64jmrsita_7/obrazek.
 +
The proof is on page 77 of the schematics.
 +
 +
Be careful when looking at specification pages on item listings, some of the wrong ones are sold as "Intel" compatible despite being the other style.
 +
 +
See [[Talos II/Hardware Compatibility List#Serial_Adapters_for_J7701_Header]] for a list of known compatible and incompatible adapters.
  
 
== What is OCC mode? ==
 
== What is OCC mode? ==
 +
 +
The On Chip Controller (OCC) is a clock / thermal management engine.
  
 
The OCC can enter a safe mode if external hardware detects a condition that would require power throttling.  This feature is not active in firmware on Talos II, but the wiring required to support it is present for future expansion.
 
The OCC can enter a safe mode if external hardware detects a condition that would require power throttling.  This feature is not active in firmware on Talos II, but the wiring required to support it is present for future expansion.
Line 95: Line 133:
 
== How do I verify the PGP key that signed the DVD? ==
 
== How do I verify the PGP key that signed the DVD? ==
  
(Unknown; while the process to verify the DVD is signed by a given key is documented, there is no documented process at this time to verify which key is the correct Raptor Sales Team key)
+
See the page on [[Verifying DVDs]].
  
 
== What is micro PCI-e? ==
 
== What is micro PCI-e? ==
  
 
Unknown.
 
Unknown.
 +
 +
== How to get versions of firmware components? ==
 +
* run <code>lsprop</code> under <code>/proc/device-tree/ibm,firmware-versions</code>
 +
* run <code>lsmcode</code> (available in <code>lsvpd</code> package)
 +
* run <code>ipmitool fru print 47</code>
 +
 +
== How to change BMC hostname ==
 +
* run <code>hostnamectl set-hostname talos-bmc</code>
 +
 +
== How to get CPU temperatures / sensors data ==
 +
* run <code>sensors</code>, usually part of the package <code>lm_sensors</code>. Make sure to have the kernel module <code>ibmpowernv</code> loaded.
 +
* run <code>ipmitool sensor</code>
 +
* run <code>psensor</code> (which has a GUI)
 +
 +
You can compare the output to the thermal specifications in Sec. 1.5 of the [[:File:POWER9 Sforza DS v19 16APR2020 pub.pdf|Sforza datasheet v1.9]].
 +
 +
== What should I do after building? ==
 +
 +
* Run [https://github.com/ColinIanKing/stress-ng stress-ng] to make sure the system is stable.
 +
* Keep an eye out for [[Checkstop]] errors.
 +
 +
== Setting the hardware real-time clock has no effect ==
 +
 +
If `hwclock --systohtc` has no effect (i.e. `hwclock --get` is unchanged), then:
 +
 +
1. From the BMC console, power off the host
 +
 +
2. Type `busctl set-property xyz.openbmc_project.Settings    /xyz/openbmc_project/time/owner xyz.openbmc_project.Time.Owner    TimeOwner s xyz.openbmc_project.Time.Owner.Owners.Host` (note the capitalization: `Host`, not `HOST` as the openbmc github issues tell you!)
 +
 +
3. Reboot the BMC
 +
 +
4. Power on the host
 +
 +
== References ==
 +
 +
<references/>

Latest revision as of 16:44, 6 May 2024

Where is the installation manual online?

Talos II: File:T2P9D01 users guide version 1 0.pdf

Blackbird: File:C1P9S01 users guide version 1 0.pdf

My motherboard bag's seal/labels are broken! Has it been compromised?

This is normal for now. (It may have been compromised still, but the broken labels don't indicate that.)

Mounting in case

Where do I get the stand-offs and screws?

They should come with your case. (Check inside drive bays and such.)

Should I use rubber spacers with the stand-offs?

Stand-offs are supposed to help ground the motherboard, so it's better not to.

My case doesn't have holes for some stand-offs!

That's not necessarily a big deal, especially for the top-left where the I/O plate helps hold it in place.

However, note that with fewer stand-offs installed, you are bending the board more than usual when inserting CPUs, RAM, or other components. Such bending may damage the board! To prevent that, you can use plastic mainboard stand-offs like these in places where proper stand-offs are missing. As an alternative, one can put a spare stand-off upside down into a screw hole from below, such that the outer thread is facing upwards, and then fixate it by screwing another stand-off on top of it.

CPU/HSF installation

How far should the HSF screw be tightened?

The screw has a hardstop; you turn until you can't.

What is an indium pad? Does the stock HSF include it?

Indium pads help heat transfer from the CPU to the HSF. 4-core and 8-core CPUs do not require them (and do not ship with them). More powerful CPUs should ship with them if required (TBD whether pre-applied to the HSF, or separately).

Should thermal paste be used?

The use of thermal paste is not recommended under any circumstance. The heatsink is attached using an unusual high-pressure mounting system which places over 200lbs of force on the CPU module, making thermal paste unnecessary.

Optionally, an indium pad may be placed between the heatsink and the CPU heatspreader to enhance dissipation. Testing found this to make no difference to temperatures for 4-core and 8-core CPUs. An indium pad is included with 18 and 22-core CPUs and its use is recommended for those CPUs.

Should I remove the label/sticker from the HSF?

No. Do not remove the label/sticker, or you will void the warranty of the HSF.

Can I use 4mm hex driver?

Yes, 5/32" = 3.97 mm.

Removing the HSF from a CPU with an indium pad

The heat emitted during the operation of the CPU may cause the indium pad to stick to the HSF and the CPU. If the HSF is removed, there is a possibility that the CPU and HSF may stick together, only to separate once the HSF has been partially removed. This could cause the CPU to fall downwards (onto the socket) at an angle, which may damage the socket. For this reason, exercise extreme caution when removing the HSF from a CPU with an indium pad which has been run at load.

Dust has settled in the CPU socket while I was removing the CPU; what should I do?

It should be safe to apply compressed air from a distance. Start with the nozzle ~12 inches away and approach 0.5 inches at a time, until the force is sufficient to dislodge and remove the dust. Stop immediately if you see the pins themselves moving; that means you're close enough to damage the socket! [1]

Front panel I/O

Which is the other side of the buttons?

Typically ground, though there is nothing mandating this in the general case. ATX case switches normally short out two adjacent pins when depressed.

FIXME: Confirm this is the case for Talos specifically.

Are the LED "cathode" pins the plus or minus side?

Minus.

What should the plus side of the LED be connected to?

The associated Anode pin.

Purpose - +
Fan fail 6 8
NIC 2 10 9
NIC 1 12 11
HD* 14 15
Power 16 15

What does the Identify button do?

Turns on and off the Identify LEDs. This is mainly useful in server farms, as the ID LED status can be both read and set via software (IPMI). The main use is making sure that the correct server is unplugged, restarted, upgraded, etc. by datacenter staff.

What does the NMI button do?

As of this writing the NMI button is ignored by the BMC. It may be used to generate an NMI in future firmware revisions, or serve another purpose entirely.

The HD activity LED doesn't work!

The integrated Microsemi controller does not report activity (yet). A much-belated SAS controller firmware update from Microsemi is expected by 04/20/2018 to enable this functionality.

In the interim, J10115 can be connected to other hardware to control the HD activity LED.

What is J10115?

Something related to HD activity LED. :)

How do I connect the "ALERT" LED mentioned in the "Initial Power-On" section of the manual?

The anode of this LED is on pin 7 of the front panel header (erroneously labelled as +3.3V in the Talos II user manual as of the time of this writing.) If an LED is connected with its anode on pin 7 and its cathode on pin 8, it will reflect the "ALERT" LED status. This LED is also referred to as the "UID" LED.

Supermicro chassis use a bidirectional LED for this purpose, doing double duty with the Fan Fail LED.

BMC serial port J7701

When buying the "serial port bracket" you will need one with Intel/TDK (DTK)/Tyan style, not AT/Everex/Gigabyte, see http://pinoutguide.com/Motherboard/rs232_header_pinout.shtml for differences. Intel is https://iczc.cz/8fi3g7r5amg33a1pjn9tl4v9r8_7/obrazek while the other one is https://iczc.cz/5dessg9ns0ht49fed64jmrsita_7/obrazek. The proof is on page 77 of the schematics.

Be careful when looking at specification pages on item listings, some of the wrong ones are sold as "Intel" compatible despite being the other style.

See Talos II/Hardware Compatibility List#Serial_Adapters_for_J7701_Header for a list of known compatible and incompatible adapters.

What is OCC mode?

The On Chip Controller (OCC) is a clock / thermal management engine.

The OCC can enter a safe mode if external hardware detects a condition that would require power throttling. This feature is not active in firmware on Talos II, but the wiring required to support it is present for future expansion.

What are the effects of the "CPU secure mode disable" jumpers?

When secure mode is disabled, the on-board SBE will not halt IPL if the next stage (hostboot) fails security verification. When secure mode is enabled, each step of the IPL process verifies the next, and will halt IPL if a discrepancy (hash difference, invalid signature, etc.) is found. Talos II ships with secure mode disabled as of this writing.

How do I verify the PGP key that signed the DVD?

See the page on Verifying DVDs.

What is micro PCI-e?

Unknown.

How to get versions of firmware components?

  • run lsprop under /proc/device-tree/ibm,firmware-versions
  • run lsmcode (available in lsvpd package)
  • run ipmitool fru print 47

How to change BMC hostname

  • run hostnamectl set-hostname talos-bmc

How to get CPU temperatures / sensors data

  • run sensors, usually part of the package lm_sensors. Make sure to have the kernel module ibmpowernv loaded.
  • run ipmitool sensor
  • run psensor (which has a GUI)

You can compare the output to the thermal specifications in Sec. 1.5 of the Sforza datasheet v1.9.

What should I do after building?

Setting the hardware real-time clock has no effect

If `hwclock --systohtc` has no effect (i.e. `hwclock --get` is unchanged), then:

1. From the BMC console, power off the host

2. Type `busctl set-property xyz.openbmc_project.Settings /xyz/openbmc_project/time/owner xyz.openbmc_project.Time.Owner TimeOwner s xyz.openbmc_project.Time.Owner.Owners.Host` (note the capitalization: `Host`, not `HOST` as the openbmc github issues tell you!)

3. Reboot the BMC

4. Power on the host

References