Power Management on Debian and FreeBSD
This article discusses power management between FreeBSD and Debian. The two operating systems use two different strategies for managing SMT modes as well as default power management configuration. This allows for close analysis of the effects on a single CPU, which can then be scaled up to multicore systems. The analysis metric uses both total energy consumed as well as distance from "perfect computing," i.e., zero time with zero energy consumed.
Contents
Operating Environment
Hardware
The configuration for QEMU VMs will be discussed upon use. They will be hosted on the bare metal machine running Debian.
Software
Debian
perf taskset cpupower frequency-set -u 3.05GHz
FreeBSD
# uname -a FreeBSD FreeBSD_BE 14.3-RELEASE FreeBSD 14.3-RELEASE releng/14.3-n271432-8c9ce319fef7 GENERIC powerpc
pmcstat cpuset debug.cpufreq.lowest
SMT profiling with Debian and perf
There is an extensive article on profile SMT performance on Debian: SMT Profiling with Debian and perf. Additional information will be provided below as needed.
SMT profiling with FreeBSD and pmcstat
pmcstat
pmcstat is the standard for accessing Performance Monitor Counter (PMC) events with FreeBSD. For profiling SMT, it is the equivalent of 'perf stat' above. It has slightly different syntax than perf, but is relatively straightforward, with flags and the command with arguments. The man pages (pmc(3), pmcstat(3)) are helpful, and there is some built-in help. The #Additional Resources section has links to other discussions of how to use pmcstat.
Although pmcstat can be run as a regular user, root is necessary to configure FreeBSD to load the hwmon module:
sysrc kld_list+=hwpmc
You will need to reboot. Once that is done, pmcstat will be available.
$ pmcstat -L | wc -l
889
The -L flag lists all of the PMC events that can be monitored. Any entry does not guarantee operation, and many generic aliases do not work for the POWER9 platform. -u gives a short blurb for the entry:
$ pmcstat -u pm_inst_cmpl pm_inst_cmpl: Number of PowerPC Instructions that completed
pmcstat can be set to monitor system-wide or application-specific monitoring. For this profiling application-specific will be used. There is a discussion of the application benchmark at SMT Profiling with Debian and perf
While the application is executing, pmcstat samples the PMCs periodically. The default setting is -w 5 for five second output with per interval incrementing. Intervals can be fractions of a second. The flag -C enables cumulative counting. -v adds verbosity, but this does not always actually mean anything. For application sampling, -p is used to specify the PMC event to be monitored. It is used as often as needed, once per event.
A typical use to count the number of completed (not issued or canceled) instructions, but with sampling once per second:
$ time pmcstat -w 1 -v -p pm_inst_cmpl ./cmp_mp -l 12 -j 2 -s CVD_OM002793.1 CVD_OM003364.1
num args: 8
-l optarg = 12
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
# p/pm_inst_cmpl
7521205
0
0
0
0 number of sequences: 222
longest sequence: 3270 ([19475, 19475), (22744, 22744)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv
973997
5.79 real 0.00 user 0.01 sys
From the earlier discussion of the benchmark software, the output from the comparison code should be recognizable. Separating the pmcstat-specific output:
# p/pm_inst_cmpl
7521205
0
0
0
0
...
973997
In the first second, 7,521,205 instructions completed. As can be seen, no additional instructions were completed, so the description is perhaps incomplete. This will be explored later.
Baseline profiling
Profiling FreeBSD with pmcstat shows both a very different SMT usage and the strong hint that something has broken in the performance monitoring code. Profiling was actually quite difficult, and the results presented here should be viewed as incomplete. For example, there is no -r/--repeat functionality in pmcstat. To profile code 100 times, it must be run 100 times. To aggregate results, the user could write a script to sum the results and do variance. This was not done for this article because too many other things seem broken.
Identical profiling as above could not be done. There is no equivalent to "L1-dcache-prefetches" in pmcstat. In defense of pmcstat, it is not explicit that PM_L1_DCACHE_ RELOAD_VALID is the same, whereas -p pm_ic_pref_req appears to be the correct association with PM_IC_PREF_REQ.
Additionally, the events often had to be provided in a specific order:
$ pmcstat -w 1 -p pm_inst_cmpl -p pm_cyc -p pm_ic_pref_req -p pm_run_cyc_st_mode ./cmp _mp -l 12 -j 8 -s CVD_OM002793.1 CVD_OM003364.1 pmcstat: ERROR: Cannot allocate process-mode pmc with specification "pm_run_cyc_st_mode": Invalid argument
Swap the order of -p pm_ic_pref_req and -p pm_run_cyc_st_mode:
$ pmcstat -w 1 -p pm_inst_cmpl -p pm_cyc -p pm_run_cyc_st_mode -p pm_ic_pref_req ./cmp
_mp -l 12 -j 8 -s CVD_OM002793.1 CVD_OM003364.1
num args: 8
...
# p/pm_inst_cmpl p/pm_cyc p/pm_run_cyc_st_mode p/pm_ic_pref_req
1416599609 1592725422 0 0
16689700532 30241803692 0 0
...
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_8_12_0_100_nuc.csv
688749406117 879561503804 0 0
pmcstat appears to lose track of the PMC counters:
$ pmcstat -w 2 -p pm_inst_cmpl -p pm_cyc -p pm_run_cyc_st_mode -p pm_run_cyc_smt4_mode
-p pm_ic_pref_req ./cmp_mp -l 12 -j 2 -s CVD_OM002793.1 CVD_OM003364.1
...
# p/pm_inst_cmpl p/pm_cyc p/pm_run_cyc_st_mode p/pm_run_cyc_smt4_mode p/pm_ic_pref_req
7513021 17473694 0 17473651 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
7869 24396 0 25015 0
...
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv
4591758 5251488732 0 5251487575 0
Several sample instances show no additional completed instructions after the previous sampled time (in this case, every two seconds).
Note that the total for the categories are radically different from the individual totals. The initial thought is that perhaps the interval sampling is reporting "per thread" while the final number is across all threads. Recall that is is a fixed calculation benchmark. While testing Debian, the number of total instructions was always 56.6 billion +/- 200 million. The above total is 4.6 billion total instructions. Additionally, when run as 32 threads:
$ pmcstat -w 2 -p pm_inst_cmpl -p pm_cyc -p pm_run_cyc_st_mode -p pm_run_cyc_smt4_mode -p pm_ic_pref_req ./cmp_mp -l 12 -j 32 -s CVD_OM002793.1 CVD_OM003364.1 ... file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_32_12_0_100_nuc.csv # p/pm_inst_cmpl p/pm_cyc p/pm_run_cyc_st_mode p/pm_run_cyc_smt4_mode p/pm_ic_pref_req 13928314726504022640 9893848149867607943 0 9893848149867439008 0
the result is a remarkable 13.9 quintillion instructions. This loss of PMC counters appears to occur around six threads:
| threads | instructions | cycles | insn/cycle |
|---|---|---|---|
| 1 | 8448738 | 19074411 | 0.44 |
| 2 | 8485336 | 19281511 | 0.44 |
| 4 | 8723238 | 19391797 | 0.45 |
| 6 | 588499149317 | 648675753231 | 0.91 |
| 8 | 1411229659037 | 1472885892760 | 0.96 |
Therefore, some inferences have to be made about the results, and the disclaimer is made that this analysis has to be viewed as incomplete.
Benchmarking prefetching and instructions per cycle
Even with the loss of accuracy in the results, some analysis can be done. Rather than run the comparison benchmark 100 times, only a single run for each thread count was done. As pmcstat does not appear to have a flag to report elapsed time, the entire command was run through "time." There was a small but measurable difference in overall performance due to this. In Debian, the benchmark ran faster when running perf, but with FreeBSD, the benchmark ran slower in pmcstat.
A typical benchmark run looked like:
$ time pmcstat -C -w 30 -p pm_inst_cmpl -p pm_cyc -p pm_run_cyc_st_mode -p pm_run_cyc_smt4_mode -p pm_ic_pref_req ./cmp_mp -l 12 -j 2 -T -s CVD_OM002793.1 CVD_OM003364.1
num args: 9
-l optarg = 12
multithreading disabled
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
number of sequences: 222
longest sequence: 3270 ([19475, 19475), (22744, 22744)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv
# p/pm_inst_cmpl p/pm_cyc p/pm_run_cyc_st_mode p/pm_run_cyc_smt4_mode p/pm_ic_pref_req
8460687 19045917 0 19043993 0
20.75 real 0.00 user 0.01 sys
That is 32 threads, with completed instructions, cycles, cycles run in SMT1 mode, cycles run in SMT4 mode, and instruction cache prefetches. Sampling was set to longer than the benchmark would run, and cumulative results displayed (should be redundant, but included for accuracy). A high degree of variance was observed during testing. It was difficult to reliably repeat results, particularly when the thread count exceeded the thread-core count. This strongly suggests a high degree of variability in switching threads on cores. Some results were rejected and the benchmark rerun. The -r/--repeat functionality from perf is a welcome feature. Instructions per cycle were calculated manually.
Just to establish a comparison of raw performance, the benchmark was run without performance monitoring on both Debian and FreeBSD (single sample, displaying real time):
| Metric | Debian | FreeBSD |
|---|---|---|
| 1 thread | 11.21s | 20.74s |
| 2 threads | 5.79s | 10.91s |
| 8 threads | 1.46s | 2.90s |
| 16 threads | 1.11s | 1.60s |
| 32 threads | 0.69s | 0.84s |
Immediately obvious is a wide disparity in performance at low thread count, which closes at high thread count.
Across the entire test runs, pmcstat always reported 0 pm_run_cyc_st_mode and 0 pm_ic_pref_req, while reporting pm_run_cyc_smt4_mode counts nearly identical (to two decimal places) to 100% of the cycle count. Without specifically looking at the POWER9 initialization code, it appears all of the cores are brought up in SMT4 mode, and no de-tuning to SMT1 mode is done when there are light loads.
The graphs tell the story:
There is no uniform decrease in instructions per cycle on FreeBSD, and while the elapsed time improves with increased threads, FreeBSD is always behind Debian.
Turning off and on SMT4
There does not appear to be any userland utility for setting the SMT mode in FreeBSD. No testing was done on different modes.
Power consumption for FreeBSD in always-on SMT4 mode
As documented earlier, some of the benchmark runs are too short to obtain reliable power readings. The long benchmark of comparing EBV against chromosome 1 was used. FreeBSD does not have a userland equivalent of "sensor" so only the OpenBMC values were used. Idle power consumption was measure at 45 watts. Maximum power consumption was 83-84 watts.
Notable is the linear increase in power through 32 threads, with non-linear decreases in elapsed time.
Analysis Metrics
Total energy consumed
Geometric distance from perfect computing
Debian and FreeBSD approaches to SMT and power management
SMT management
Debian and FreeBSD take two very different approaches to managing SMT modes. Debian sets the SMT mode based on load, and when additional threads are needed, Debian appears to use a "skip" approach where instead of choosing the next logical CPU, it chooses the next physical core. FreeBSD sets SMT4 full-time, and appears choose the next logical CPU for additional thread requirements. This results in different power use and elapsed time profiles:
The differences in power consumption is so significant that an entire wiki entry is now dedicated to it: FreeBSD Debian Power Management. In summary, at all thread counts, Debian is faster than FreeBSD, but at a cost of higher energy use except for when running a single thread. The difference at low thread counts is significant. At high thread counts, FreeBSD uses 1/3 (33%) less energy while only taking 17% longer. Although at low thread counts the lower performance by FreeBSD can be attributed to the use of SMT4 mode all of the time, this is not the full picture.
Power management
Debian defaults to power management on
cpufreq
FreeBSD defaults to power management off
powerd
Isolating Debian and FreeBSD to a single core
Additional Resources
POWER9 Performance Monitoring Unit User Guide v12
POWER CPU Memory Affinity 3 - Scheduling processes to SMT and Virtual Processors
George Neville-Neil's brief tutorial on pmcstat
İbrahim Korucuoğlu's How to Profile Applications with `pmcstat` on FreeBSD Operating System
Evaluating the Energy Measurements of the IBM POWER9 On-Chip Controller (uses Linux and hwmon)
Developer access
Raptor Computing Systems supports many different development models from bare metal access (e.g. kernel hacking) through software development. If you are interested in developing with FreeBSD and would like an account on our system, please email support@ this domain (raptorcs.com) with a description of your interests.