Difference between revisions of "Compiling Firmware"

From RCS Wiki
Jump to navigation Jump to search
(→‎BMC Recovery procedure via U-Boot: add some quick notes from memory and experiments to the todo section.)
(9 intermediate revisions by 4 users not shown)
Line 49: Line 49:
 
[[Raptor Computing Systems|Raptor CS]] maintains a public git repository containing the complete source code for the firmware.
 
[[Raptor Computing Systems|Raptor CS]] maintains a public git repository containing the complete source code for the firmware.
 
To download the source code:
 
To download the source code:
<pre>git clone --recursive https://scm.raptorcs.com/scm/git/talos-op-build</pre>
+
<pre>git clone -b raptor-v1.05 --recursive https://scm.raptorcs.com/scm/git/talos-op-build</pre>
 +
 
 +
'''Note: The master branch is often in a non-functional state. The latest firmware branch (raptor-v1.05 at the time of this update) should be used instead.'''
  
 
=== Building the firmware ===
 
=== Building the firmware ===
Line 86: Line 88:
 
<pre>obmcutil chassisoff</pre>
 
<pre>obmcutil chassisoff</pre>
  
 +
==== Testing the firmware ====
 +
In order to test the firmware, a modified mboxd[https://gerrit.openbmc-project.xyz/#/c/openbmc/mboxbridge/+/14384/] binary can be used.
 +
 +
First, mboxd must be terminated:
 +
<pre>systemctl stop mboxd</pre>
 +
 +
Next, restart mboxd with the additional -b argument:
 +
<pre>mboxd -f 64M -w 1M -b /tmp/talos.pnor</pre>
 +
 +
Finally, you can test the update pnor image by starting the machine:
 +
<pre>obmcutil poweron</pre>
 +
 +
Once you've verified that everything is working, stop the machine:
 +
<pre>obmcutil poweroff</pre>
 +
'''NOTE: The system must be in the off state before proceeding. This should be verified with obmcutil state as shown earlier.'''
 +
 +
Before continuing to flash the new pnor image, the original mboxd must be started.
 +
Ctrl^C can be used to terminate mboxd. Once done, restart it using systemctl:
 +
<pre>systemctl start mboxd</pre>
 +
 +
You can now proceed with flassing the firmware.
 +
 +
==== Flashing the firmware ====
 
Once off, perform the update:
 
Once off, perform the update:
 
<pre>pflash -E -p /tmp/talos.pnor</pre>
 
<pre>pflash -E -p /tmp/talos.pnor</pre>
Line 95: Line 120:
  
 
=== Troubleshooting ===
 
=== Troubleshooting ===
 +
 +
==== Always upgrade PNOR and BMC together ====
 +
 +
Many mismatched PNOR/BMC version combinations lead to weird failures.
 +
 +
==== Try downgrading the PNOR+BMC firmware ====
 +
 +
Firmware package 1.04 seems the most reliable at updating the SBE SEEPROM inside the POWER9 chip package.
 +
 +
==== Always use PROC 0 socket for SBE updates ====
 +
 +
The BMC firmware and/or FSI driver seem to either forget to update the SBE SEEPROM in the PROC1 (secondary) socket, leading to a bootup with only PROC0 active.  When you get a brand new chip you need to install it in PROC0 leaving PROC1 empty, wait for the double-reboot to update the SEEPROM, and then you can move that chip to the PROC1 socket if you like.
 +
 +
==== Try unplugging the HSF fan power during SBE update ====
 +
 +
Not kidding about this.  The BMC is insanely complicated -- it's got an entire operating system in there for some reason.  It even has systemd.  The BMC's systemd often gets into a funky loop restarting Hwmon over and over and over, interrupting the SBE SEEPROM reflash every time it does this.  Unplugging the PROC0 HSF 4-pin connector gets it to fail hard (due to inability to read the tachometer) and stay failed so the SBE update can proceed.  Ugly as this is, it's easier than trying to figure out what systemd thinks it's doing.
 +
 
==== SBE_MASTER_VERSION_DOWNLEVEL ====
 
==== SBE_MASTER_VERSION_DOWNLEVEL ====
 
If you see the following message reported in the console, then the SBE update process did not work as expected:
 
If you see the following message reported in the console, then the SBE update process did not work as expected:
Line 142: Line 184:
  
 
The resulting firmware can be found in the tmp/deploy/images/talos/ directory.
 
The resulting firmware can be found in the tmp/deploy/images/talos/ directory.
 +
 +
If <b>mboxd</b> fails to build, you may need to [https://github.com/openbmc/openbmc/issues/2780 patch mboxd.bb].
  
 
=== Updating the firmware ===
 
=== Updating the firmware ===
Line 155: Line 199:
  
 
=== BMC Recovery procedure via U-Boot ===
 
=== BMC Recovery procedure via U-Boot ===
While these instructions have been successfully applied in practice, they are still preliminary. Ask questions in IRC if you are unclear on what to do!
+
{{:Talos_II/U-Boot_Recovery}}
<!-- Hi fellow wiki people! Ask Bdragon in IRC if you have questions about this procedure.
 
    IRC user dragon_pilot was successfully able to recover a nonworking BMC from u-boot, these instructions are the result of that experiment.
 
    Further testing and refinement would be appreciated, preferably by someone who has easy access to an external flasher.
 
-->
 
 
 
In the event of a failure updating the BMC, but with a functioning u-boot, you can still recover by using U-Boot to manually bootstrap the BMC by manually loading a boot image over the network or BMC serial line.
 
 
 
If your BMC flash is corrupted to the extent that U-Boot is not loading properly, you '''WILL''' need to remove and flash the BMC flash chip externally.
 
 
 
* Prepare a TFTP server, and place <code>image-bmc</code>, <code>image-rofs</code>, and <code>image-kernel</code> in the root. (TODO: elaborate on how to set this up)
 
 
 
* Connect a serial console to the [[Talos_II/Building_FAQ#BMC_serial_port_J7701|BMC serial port]] (J7701, serial port bracket required) and set to 115200 8n1, disable RTS/CTS (hardware flow control).
 
* Disconnect and reconnect power to the machine to force a BMC restart. Press a key to interrupt auto-boot when prompted.
 
* Run <code>dhcp x.x.x.x:image-bmc</code>, replacing the IP address of your TFTP server. This will load a copy of the stock boot image into RAM.
 
* Run <code>bootm 83080000</code>. This will prepare and boot off of the loaded virtual image.
 
* If your rofs partition is not functional, you will be dropped into the systemd emergency shell at this point. Try both the password you set as well as the default <code>0penBmc</code>, it may be one or the other depending on the state of the rwfs partition. If it boots up properly instead of dropping you into the emergency shell, the problem is probably in your kernel partition and you can retry flashing your <code>image-kernel</code> using the normal procedure. (The rest of these instructions are for the systemd emergency shell.)
 
* <code>mount -t tmpfs none /tmp</code>
 
* run <code>udhcpc</code> to get an IP address. (TODO: verify that this is the actual command that you run. Do you have to specify the network interface too?)
 
* <code>cd /tmp</code>
 
* <code>tftp -g -r image-rofs x.x.x.x</code>
 
* <code>tftp -g -r image-kernel x.x.x.x</code>
 
* IMPORTANT: Use <code>md5sum</code>, <code>sha1sum</code>, or <code>sha256sum</code> to verify successful transfer of image-rofs and image-kernel! tftp is a very barebones protocol and relies on transport layer checksumming, which is optional and not always available in UDP!
 
* Verify that the output of <code>cat /sys/class/mtd/mtd3/name</code> is <code>kernel</code> and the output of <code>cat /sys/class/mtd/mtd4/name</code> is <code>rofs</code>. We will be flashing mtd partitions directly in the next step and this is the last chance to verify that they will be flashed to the correct partition.
 
* <code>flashcp -v image-kernel /dev/mtd3</code>
 
* <code>flashcp -v image-rofs /dev/mtd4</code>
 
* (TODO: Describe how to reset rwfs in case it was damaged as well?) note: the kernel param for bypassing rwfs is "overlay-filesystem-in-ram". Append it to the existing boot-args before running the bootm command. This can also be used as part of a password reset procedure.
 
* After the flash is complete, you can run restart the BMC and it should boot successfully.
 
 
 
* (TODO: Discussion of using Kermit to upload the image without network access) note: I (Bdragon) have successfully done a ram-only boot using cu's built in xmodem support (escape sequence ~X) to do an image transfer into RAM over the BMC serial interface.
 
* (TODO: Discuss using u-boot's built in cmp tool to perform basic validation of the u-boot image against a second copy loaded into RAM.)
 
* (TODO: Write a u-boot standalone application to disable the AST watchdog, and write instructions for loading and executing it from the u-boot shell (the "go" command), to work around the cold-boot watchdog issue.)
 
* (TODO: Load recovery images over USB?) note: The onboard USB port is connected to the USB switch after all, so this might be problematic.
 
* (TODO: Discussion of u-boot memory map) Short version is: flash lives at 0x20000000 and default base address for the memory loading tools is 0x83000000. So add 0x63000000 to any flash address to get the eqivilent address for an image-bmc file loaded into RAM. For example, the bootable image of a loaded image-bmc is at 0x83080000.
 
  
 
=== Troubleshooting ===
 
=== Troubleshooting ===
 
TODO
 
TODO

Revision as of 13:44, 12 January 2019

The following steps can be used to compile and update the firmware on Talos™ II-based solutions. It's maintained by both Raptor CS and community members.

Requirements

  • At least 25GB of free hard drive space
  • 16GB of free RAM

Operating System

The build system (op-build) has been primarily tested using Debian stretch. If you are on a different operating system such as Fedora 28, a Debian chroot should be used:

sudo yum install debootstrap dpkg
sudo debootstrap stretch debian-chroot http://httpredir.debian.org/debian
sudo mount -t proc none debian-chroot/proc/
sudo mount -o bind /sys/ debian-chroot/sys/
sudo mount -o bind /dev/shm/ debian-chroot/dev/shm/

Enter the chroot and install the needed packages:

sudo chroot debian-chroot/
apt-get install software-properties-common locales
# Packages needed for PNOR builds
apt-get install cscope ctags libz-dev libexpat-dev \
          python texinfo \
          build-essential g++ git bison flex unzip \
          libssl-dev libxml-simple-perl libxml-sax-perl libxml2-dev libxml2-utils xsltproc \
          wget bc rsync
# Packages needed for OpenBMC builds
apt-get install git build-essential libsdl1.2-dev texinfo gawk chrpath diffstat

Create a chroot user:

useradd -m build-user -s /bin/bash
su build-user
cd

You can now use the chroot to build the firmware.

To enter the chroot in the future, you can run the following from a regular terminal:

sudo chroot debian-chroot/
su build-user
cd

Building the PNOR Firmware

Grabbing the sources

Raptor CS maintains a public git repository containing the complete source code for the firmware. To download the source code:

git clone -b raptor-v1.05 --recursive https://scm.raptorcs.com/scm/git/talos-op-build

Note: The master branch is often in a non-functional state. The latest firmware branch (raptor-v1.05 at the time of this update) should be used instead.

Building the firmware

Before building the firmware, all needed support packages must be installed. Please see the README.md file for directions on installing the needed packages.

Once the packages are installed, the firmware can be build using the following commands:

cd talos-op-build
. op-build-env
op-build talos_defconfig
op-build

To rebuild an individual package (such as hostboot) and recreate the pnor image, the following can be run:

op-build hostboot-rebuild openpower-pnor-rebuild

Updating the firmware

Copy the firmware to the BMC

scp ./output/images/talos.pnor root@<talos-openbmc>:/tmp/


At this point, you should connect two SSH sessions to OpenBMC. In the first session, run the following to display the console during bootup:

ssh -p 2200 root@<talos-openbmc>

The console log will be useful in debugging any issues with the firmware that could occur.

In the second BMC session, ensure the system is off by running obmcutil. You should see the following:

ssh root@<talos-openbmc>
root@talos:~# obmcutil state
CurrentBMCState     : xyz.openbmc_project.State.BMC.BMCState.Ready
CurrentPowerState   : xyz.openbmc_project.State.Chassis.PowerState.Off
CurrentHostState    : xyz.openbmc_project.State.Host.HostState.Off

The CurrentHostState must be Off before continuing with the procedure. If the CurrentHostState is not Off, please turn off the machine:

obmcutil chassisoff

Testing the firmware

In order to test the firmware, a modified mboxd[1] binary can be used.

First, mboxd must be terminated:

systemctl stop mboxd

Next, restart mboxd with the additional -b argument:

mboxd -f 64M -w 1M -b /tmp/talos.pnor

Finally, you can test the update pnor image by starting the machine:

obmcutil poweron

Once you've verified that everything is working, stop the machine:

obmcutil poweroff

NOTE: The system must be in the off state before proceeding. This should be verified with obmcutil state as shown earlier.

Before continuing to flash the new pnor image, the original mboxd must be started. Ctrl^C can be used to terminate mboxd. Once done, restart it using systemctl:

systemctl start mboxd

You can now proceed with flassing the firmware.

Flashing the firmware

Once off, perform the update:

pflash -E -p /tmp/talos.pnor

Start the machine:

obmcutil poweron

Note: the machine may reboot multiple times after the initial flash.

Troubleshooting

Always upgrade PNOR and BMC together

Many mismatched PNOR/BMC version combinations lead to weird failures.

Try downgrading the PNOR+BMC firmware

Firmware package 1.04 seems the most reliable at updating the SBE SEEPROM inside the POWER9 chip package.

Always use PROC 0 socket for SBE updates

The BMC firmware and/or FSI driver seem to either forget to update the SBE SEEPROM in the PROC1 (secondary) socket, leading to a bootup with only PROC0 active. When you get a brand new chip you need to install it in PROC0 leaving PROC1 empty, wait for the double-reboot to update the SEEPROM, and then you can move that chip to the PROC1 socket if you like.

Try unplugging the HSF fan power during SBE update

Not kidding about this. The BMC is insanely complicated -- it's got an entire operating system in there for some reason. It even has systemd. The BMC's systemd often gets into a funky loop restarting Hwmon over and over and over, interrupting the SBE SEEPROM reflash every time it does this. Unplugging the PROC0 HSF 4-pin connector gets it to fail hard (due to inability to read the tachometer) and stay failed so the SBE update can proceed. Ugly as this is, it's easier than trying to figure out what systemd thinks it's doing.

SBE_MASTER_VERSION_DOWNLEVEL

If you see the following message reported in the console, then the SBE update process did not work as expected:

 16.74709|Error reported by sbe (0x2200) PLID 0x90000008
 16.74823|  SBE Image Version Miscompare with Master Target
 16.74824|  ModuleId   0x0d SBE_MASTER_VERSION_COMPARE
 16.74825|  ReasonCode 0x2215 SBE_MASTER_VERSION_DOWNLEVEL
 16.74826|  UserData1  Master Target HUID : 0x0000000000050000
 16.74826|  UserData2  Master Target Loop Index : 0x0000000000000000

The machine needs to be reset to finish the update proceedure using the following:

obmcutil chassisoff
systemctl stop xyz.openbmc_project.State.Host.service
systemctl start xyz.openbmc_project.State.Host.service
obmcutil poweron

The update should now complete as expected.

A bug report is open[2] to track this issue.

internal compiler error: Killed

Building the hostboot source code requires a large amount of ram. If your machine runs out, you may see an error similar ot the following:

powerpc64le-buildroot-linux-gnu-g++.br_real: internal compiler error: Killed (program cc1plus)

To continue you have a few options:

  • Reduce the number of parallel jobs being run by appending -j<num> to you build command line
op-build -j4
  • Increase the swap space
  • Install additional RAM

Building the OpenBMC firmware

Grabbing the sources

Raptor CS maintains a public git repository containing the complete source code for the firmware. To download the source code and check out the tag:

 git clone https://git.raptorcs.com/git/talos-openbmc
 cd talos-openbmc
 git checkout raptor-v1.07

Building the firmware

Before building the firmware, all needed support packages must be installed. Please see the README.md file for directions on installing the needed packages.

Once the packages are installed, the firmware can be build using the following commands:

cd talos-openbmc
export TEMPLATECONF=meta-openbmc-machines/meta-openpower/meta-rcs/meta-talos/conf
. openbmc-env
bitbake obmc-phosphor-image

The resulting firmware can be found in the tmp/deploy/images/talos/ directory.

If mboxd fails to build, you may need to patch mboxd.bb.

Updating the firmware

Once firmware has been built, the resulting kernel and rofs binaries need to be copied over to the /run/initramfs/

scp tmp/deploy/images/talos/image-rofs tmp/deploy/images/talos/image-kernel root@<talos-openbmc>:/run/initramfs/

Once the images have been transferred, reboot the BMC:

root@<talos-openbmc> reboot

OpenBMC may take a while to reboot. Once complete, you will be able to log back in via ssh.

BMC Recovery procedure via U-Boot

Purpose

This guide explains how to debrick the BMC when the BMC has been rendered inoperable, for example due to a defective firmware update.

Applicability

All RCS OpenPOWER systems.

Overview

There are three means of debricking the BMC:

  • Remove the BMC SPI flash chip and reflash it with a flash programmer
    • Note: flashrom versions earlier than 1.1 do not support the BMC flash chip
  • Flash new BMC firmware via U-Boot TFTP (requires that U-Boot is still intact on the flash)
  • Flash new BMC firmware via serial port (requires proprietary BMC chip vendor tool)

Reset persistent storage

This is applicable if somehow the persistent storage (SSH keys, passwords, IPMI error logs, etc.) has been corrupted, but the read only data (U-boot, kernel, initramfs) are all intact. This is also the easiest and least invasive recovery method if you have forgotten the BMC password.

From the U-boot prompt on the BMC serial console, run the following (must be run quickly, to avoid watchdog timeouts):

printenv

Look at the bootargs command, set the same environment variable but insert overlay-filesystem-in-ram before the rw keyword.

Example for Blackbird HW version 1.01:

setenv bootargs console=ttyS4,115200n8 root=/dev/ram overlay-filesystem-in-ram rw

Then run boot to continue the boot process.

This will start the BMC with default settings, but the existing persistent data has not yet been cleared. To clear it, log in as root, then run:

flash_eraseall /dev/mtd/rwfs

reboot

Flash new BMC firmware via U-Boot TFTP

Note: While these instructions have been successfully applied in practice, they are still preliminary. Ask questions in IRC if you are unclear on what to do!

In the event of a failure when updating the BMC, but with a functioning U-boot, you can still recover by using U-Boot to manually bootstrap the BMC by manually loading a boot image over the network or the BMC serial port.

If your BMC flash is corrupted to the extent that U-Boot does not load properly, these instructions will not work; you will need to remove and reflash the BMC flash chip externally, or flash new firmware via serial port.

  • Prepare a TFTP server, and place image-bmc, image-rofs, and image-kernel in the root.
  • Connect a serial console to the BMC serial port (J7701, serial port bracket required). The serial port configuration is 115200,8n1.
  • Disconnect and reconnect power to the machine to force a BMC restart. Press a key to interrupt auto-boot when prompted.
  • If you are having trouble with U-Boot resetting while you are trying to run these steps, have a slow network, or you are going to be loading over serial, you can disable the FPGA watchdog.
  • Run dhcp x.x.x.x:image-bmc, replacing the IP address of your TFTP server. This will load a copy of the stock boot image into RAM.
  • Run bootm 83080000. This will prepare and boot off of the loaded virtual image.
  • If your rofs partition is not functional, you will be dropped into the systemd emergency shell at this point. Try both the password you set as well as the default password, it may be one or the other depending on the state of the rwfs partition. If it boots up properly instead of dropping you into the emergency shell, the problem is probably in your kernel partition and you can retry flashing your image-kernel using the normal procedure. (The rest of these instructions are for the systemd emergency shell.)
  • mount -t tmpfs none /tmp
  • run udhcpc to get an IP address. (TODO: verify that this is the actual command that you run. Do you have to specify the network interface too?)
  • cd /tmp
  • tftp -g -r image-rofs x.x.x.x
  • tftp -g -r image-kernel x.x.x.x
  • IMPORTANT: Use md5sum, sha1sum, or sha256sum to verify successful transfer of image-rofs and image-kernel! tftp is a very barebones protocol and relies on transport layer checksumming, which is optional and not always available in UDP!
  • Verify that the output of cat /sys/class/mtd/mtd3/name is kernel and the output of cat /sys/class/mtd/mtd4/name is rofs. We will be flashing mtd partitions directly in the next step and this is the last chance to verify that they will be flashed to the correct partition.
  • flashcp -v image-kernel /dev/mtd3
  • flashcp -v image-rofs /dev/mtd4
  • (TODO: Describe how to reset rwfs in case it was damaged as well?) note: the kernel param for bypassing rwfs is "overlay-filesystem-in-ram". Append it to the existing boot-args before running the bootm command. This can also be used as part of a password reset procedure.
  • After the flash is complete, you can run restart the BMC and it should boot successfully.
  • (TODO: Discussion of using Kermit to upload the image without network access) note: I (Bdragon) have successfully done a ram-only boot using cu's built in xmodem support (escape sequence ~X) to do an image transfer into RAM over the BMC serial interface.
  • (TODO: Discuss using u-boot's built in cmp tool to perform basic validation of the u-boot image against a second copy loaded into RAM.)
  • (TODO: Load recovery images over USB?) note: The onboard USB port is connected to the USB switch after all, so this might be problematic.
  • (TODO: Discussion of u-boot memory map) Short version is: flash lives at 0x20000000 and default base address for the memory loading tools is 0x83000000. So add 0x63000000 to any flash address to get the eqivilent address for an image-bmc file loaded into RAM. For example, the bootable image of a loaded image-bmc is at 0x83080000.

Flash new BMC firmware via serial port (Open Source Method)

Tools required:

  • BMC serial port
  • A secure computer with a serial port (usb to serial works fine) - preferably running linux (Linux on POWER is fine).

Software:

  • flashrom with serial ASpeed flash support from [3]

Procedure:

  1. Build flashrom on your Linux or BSD PC.
  2. Extract the BMC firmware bundle.
  3. Set the FPGA RUN/RESET switch to RESET.
    • On a Blackbird, this switch is located roughly between the flash chips and the PCIe slots. If you have a GPU installed in the x16 slot, you may need to remove it.
  4. Apply standby power to the mainboard
  5. Run the following command ./flashrom --verbose --programmer 'ast2400:serial=/dev/ttyUSB0,cpu=halt,spibus=0' -c MX25L25635F/MX25L25645E/MX25L25665E -w image-bmc
    • if your serial interface can handle the baudrate 921600 add the parameter: high_speed_uart=true
    • NOTE: If you are using updated firmware (Talos II/Lite 2.0 beta firmware or later) or are using a Blackbird, U-boot will shut down access to this interface after about 3 seconds of standby power, so you will need to run the command *immediately* after plugging in the power supply to bypass this.
  6. Be Patient: this will take a *long* time.
  7. Once the flash has been verified, set the FPGA RUN/RESET switch to RUN.

Flash new BMC firmware via serial port (Proprietary Method)

This method was discovered by Centurion Dan as an alternative to pulling and reflashing the BMC SPI chip after a failed update had corrupted/wiped U-Boot.

Tools required:

  • BMC serial port
  • An x86 computer with a serial port (usb to serial works fine) - preferably running linux.

Software:

 ASPEED SOC Flash Utility --- The utility has been moved to Document Download Page for ASPEED registered developers to access.

Procedure:

  1. Unzip the SOC FLASH Utility on your other computer, and unzip the appropriate SOC Flash Utility bundle for that computer.
  2. Extract the BMC firmware bundle.
  3. Run the following command ./socflash -s option=u comport="4" cs=0 if=image-u-boot gpio_b=S71 gpio_a=S70 option=f
    • You can drop the option=f for a slower but verified write process
    • if your serial interface can handle the baudrate 921600 add the parameter: baudrate=921600
    • if you want to see what is going on, you can strace it by prepending: strace -e trace=open,close,read,write to the command above.
    • NOTE: If you are using updated firmware (Talos II/Lite 2.0 beta firmware or later) or are using a Blackbird, U-boot will shut down access to this interface after about 3 seconds of standby power, so you will need to run the command *immediately* after plugging in the power supply to bypass this.
  4. Be Patient: it took me about 45 minutes to complete the flash process.

Notes:

  • gpio_b=S71 and gpio_a=S70 are used to turn off the fpga watchdog timer before the flash process and then re-enables it after it's completed.
  • On a Blackbird, replace gpio_b=S71 with gpio_b=G01 and gpio_a=S70 with gpio_a=G00. Due to the new HDMI interface, the BMC watchdog GPIO was moved to a different pin on the AST2500.

Troubleshooting

TODO