Power Management on Debian and FreeBSD

From RCS Wiki
Jump to navigation Jump to search

This article discusses power management between FreeBSD and Debian. The two operating systems use two different strategies for managing SMT modes as well as default power management configuration. This allows for close analysis of the effects on a single CPU, which can then be scaled up to multicore systems. The analysis metric uses both total energy consumed as well as distance from "perfect computing," i.e., zero time with zero energy consumed.

Operating Environment

Hardware

The configuration for QEMU VMs will be discussed upon use. They will be hosted on the bare metal machine running Debian.

Software

Debian

perf taskset cpupower frequency-set -u 3.05GHz


FreeBSD

# uname -a
FreeBSD FreeBSD_BE 14.3-RELEASE FreeBSD 14.3-RELEASE releng/14.3-n271432-8c9ce319fef7 GENERIC powerpc

pmcstat cpuset debug.cpufreq.lowest

SMT profiling with Debian and perf

There is an extensive article on profile SMT performance on Debian: SMT Profiling with Debian and perf. Additional information will be provided below as needed.


SMT profiling with FreeBSD and pmcstat

pmcstat

pmcstat is the standard for accessing Performance Monitor Counter (PMC) events with FreeBSD. For profiling SMT, it is the equivalent of 'perf stat' above. It has slightly different syntax than perf, but is relatively straightforward, with flags and the command with arguments. The man pages (pmc(3), pmcstat(3)) are helpful, and there is some built-in help. The #Additional Resources section has links to other discussions of how to use pmcstat.

Although pmcstat can be run as a regular user, root is necessary to configure FreeBSD to load the hwmon module:

sysrc kld_list+=hwpmc

You will need to reboot. Once that is done, pmcstat will be available.

$ pmcstat -L | wc -l
     889

The -L flag lists all of the PMC events that can be monitored. Any entry does not guarantee operation, and many generic aliases do not work for the POWER9 platform. -u gives a short blurb for the entry:

$ pmcstat -u pm_inst_cmpl
pm_inst_cmpl:	Number of PowerPC Instructions that completed

pmcstat can be set to monitor system-wide or application-specific monitoring. For this profiling application-specific will be used. There is a discussion of the application benchmark at SMT Profiling with Debian and perf

While the application is executing, pmcstat samples the PMCs periodically. The default setting is -w 5 for five second output with per interval incrementing. Intervals can be fractions of a second. The flag -C enables cumulative counting. -v adds verbosity, but this does not always actually mean anything. For application sampling, -p is used to specify the PMC event to be monitored. It is used as often as needed, once per event.

A typical use to count the number of completed (not issued or canceled) instructions, but with sampling once per second:

$ time  pmcstat -w 1 -v -p pm_inst_cmpl ./cmp_mp -l 12 -j 2 -s CVD_OM002793.1 CVD_OM003364.1
num args: 8
-l optarg = 12
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
# p/pm_inst_cmpl 
         7521205 
               0 
               0 
               0 
               0 number of sequences: 222
longest sequence: 3270 ([19475, 19475), (22744, 22744)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv

          973997 
        5.79 real         0.00 user         0.01 sys

From the earlier discussion of the benchmark software, the output from the comparison code should be recognizable. Separating the pmcstat-specific output:

# p/pm_inst_cmpl 
         7521205 
               0 
               0 
               0 
               0 
...
          973997

In the first second, 7,521,205 instructions completed. As can be seen, no additional instructions were completed, so the description is perhaps incomplete. This will be explored later.

Baseline profiling

Profiling FreeBSD with pmcstat shows both a very different SMT usage and the strong hint that something has broken in the performance monitoring code. Profiling was actually quite difficult, and the results presented here should be viewed as incomplete. For example, there is no -r/--repeat functionality in pmcstat. To profile code 100 times, it must be run 100 times. To aggregate results, the user could write a script to sum the results and do variance. This was not done for this article because too many other things seem broken.

Identical profiling as above could not be done. There is no equivalent to "L1-dcache-prefetches" in pmcstat. In defense of pmcstat, it is not explicit that PM_L1_DCACHE_ RELOAD_VALID is the same, whereas -p pm_ic_pref_req appears to be the correct association with PM_IC_PREF_REQ.

Additionally, the events often had to be provided in a specific order:

$  pmcstat -w 1 -p pm_inst_cmpl -p pm_cyc -p pm_ic_pref_req -p pm_run_cyc_st_mode ./cmp
_mp -l 12 -j 8 -s CVD_OM002793.1 CVD_OM003364.1
pmcstat: ERROR: Cannot allocate process-mode pmc with specification "pm_run_cyc_st_mode": Invalid argument

Swap the order of -p pm_ic_pref_req and -p pm_run_cyc_st_mode:

$ pmcstat -w 1 -p pm_inst_cmpl -p pm_cyc -p pm_run_cyc_st_mode -p pm_ic_pref_req ./cmp
_mp -l 12 -j 8 -s CVD_OM002793.1 CVD_OM003364.1
num args: 8
...
# p/pm_inst_cmpl   p/pm_cyc p/pm_run_cyc_st_mode p/pm_ic_pref_req 
      1416599609 1592725422                    0                0 
     16689700532 30241803692                   0                0 
...
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_8_12_0_100_nuc.csv

    688749406117 879561503804                   0                0 

pmcstat appears to lose track of the PMC counters:

$  pmcstat -w 2 -p pm_inst_cmpl -p pm_cyc -p pm_run_cyc_st_mode -p pm_run_cyc_smt4_mode
 -p pm_ic_pref_req ./cmp_mp -l 12 -j 2 -s CVD_OM002793.1 CVD_OM003364.1
...
# p/pm_inst_cmpl   p/pm_cyc p/pm_run_cyc_st_mode p/pm_run_cyc_smt4_mode p/pm_ic_pref_req 
         7513021   17473694                    0               17473651                0 
               0          0                    0                      0                0 
               0          0                    0                      0                0 
               0          0                    0                      0                0 
            7869      24396                    0                  25015                0 
...
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv

      4591758 5251488732                    0             5251487575                0

Several sample instances show no additional completed instructions after the previous sampled time (in this case, every two seconds).

Note that the total for the categories are radically different from the individual totals. The initial thought is that perhaps the interval sampling is reporting "per thread" while the final number is across all threads. Recall that is is a fixed calculation benchmark. While testing Debian, the number of total instructions was always 56.6 billion +/- 200 million. The above total is 4.6 billion total instructions. Additionally, when run as 32 threads:

$  pmcstat -w 2 -p pm_inst_cmpl -p pm_cyc -p pm_run_cyc_st_mode -p pm_run_cyc_smt4_mode -p pm_ic_pref_req ./cmp_mp -l 12 -j 32 -s CVD_OM002793.1 CVD_OM003364.1
...
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_32_12_0_100_nuc.csv
# p/pm_inst_cmpl   p/pm_cyc p/pm_run_cyc_st_mode p/pm_run_cyc_smt4_mode p/pm_ic_pref_req 
13928314726504022640 9893848149867607943                    0    9893848149867439008                0 

the result is a remarkable 13.9 quintillion instructions. This loss of PMC counters appears to occur around six threads:

threads instructions cycles insn/cycle
1 8448738 19074411 0.44
2 8485336 19281511 0.44
4 8723238 19391797 0.45
6 588499149317 648675753231 0.91
8 1411229659037 1472885892760 0.96

Therefore, some inferences have to be made about the results, and the disclaimer is made that this analysis has to be viewed as incomplete.

Benchmarking prefetching and instructions per cycle

Even with the loss of accuracy in the results, some analysis can be done. Rather than run the comparison benchmark 100 times, only a single run for each thread count was done. As pmcstat does not appear to have a flag to report elapsed time, the entire command was run through "time." There was a small but measurable difference in overall performance due to this. In Debian, the benchmark ran faster when running perf, but with FreeBSD, the benchmark ran slower in pmcstat.

A typical benchmark run looked like:

$ time pmcstat -C -w 30 -p pm_inst_cmpl -p pm_cyc -p pm_run_cyc_st_mode -p pm_run_cyc_smt4_mode -p pm_ic_pref_req ./cmp_mp -l 12 -j 2 -T -s CVD_OM002793.1 CVD_OM003364.1
num args: 9
-l optarg = 12
multithreading disabled
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
number of sequences: 222
longest sequence: 3270 ([19475, 19475), (22744, 22744)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv
# p/pm_inst_cmpl   p/pm_cyc p/pm_run_cyc_st_mode p/pm_run_cyc_smt4_mode p/pm_ic_pref_req 
         8460687   19045917                    0               19043993                0 
       20.75 real         0.00 user         0.01 sys

That is 32 threads, with completed instructions, cycles, cycles run in SMT1 mode, cycles run in SMT4 mode, and instruction cache prefetches. Sampling was set to longer than the benchmark would run, and cumulative results displayed (should be redundant, but included for accuracy). A high degree of variance was observed during testing. It was difficult to reliably repeat results, particularly when the thread count exceeded the thread-core count. This strongly suggests a high degree of variability in switching threads on cores. Some results were rejected and the benchmark rerun. The -r/--repeat functionality from perf is a welcome feature. Instructions per cycle were calculated manually.

Just to establish a comparison of raw performance, the benchmark was run without performance monitoring on both Debian and FreeBSD (single sample, displaying real time):

Metric Debian FreeBSD
1 thread 11.21s 20.74s
2 threads 5.79s 10.91s
8 threads 1.46s 2.90s
16 threads 1.11s 1.60s
32 threads 0.69s 0.84s

Immediately obvious is a wide disparity in performance at low thread count, which closes at high thread count.

Across the entire test runs, pmcstat always reported 0 pm_run_cyc_st_mode and 0 pm_ic_pref_req, while reporting pm_run_cyc_smt4_mode counts nearly identical (to two decimal places) to 100% of the cycle count. Without specifically looking at the POWER9 initialization code, it appears all of the cores are brought up in SMT4 mode, and no de-tuning to SMT1 mode is done when there are light loads.

The graphs tell the story:

FreeBSD time insn cycle.pngFreeBSD vs Debian.png

There is no uniform decrease in instructions per cycle on FreeBSD, and while the elapsed time improves with increased threads, FreeBSD is always behind Debian.

Turning off and on SMT4

There does not appear to be any userland utility for setting the SMT mode in FreeBSD. No testing was done on different modes.

Power consumption for FreeBSD in always-on SMT4 mode

As documented earlier, some of the benchmark runs are too short to obtain reliable power readings. The long benchmark of comparing EBV against chromosome 1 was used. FreeBSD does not have a userland equivalent of "sensor" so only the OpenBMC values were used. Idle power consumption was measure at 45 watts. Maximum power consumption was 83-84 watts.

FreeBSD Power Elapsed Time.png

Notable is the linear increase in power through 32 threads, with non-linear decreases in elapsed time.

Analysis Metrics

Total energy consumed

Geometric distance from perfect computing

Debian and FreeBSD approaches to SMT and power management

SMT management

Debian and FreeBSD take two very different approaches to managing SMT modes. Debian sets the SMT mode based on load, and when additional threads are needed, Debian appears to use a "skip" approach where instead of choosing the next logical CPU, it chooses the next physical core. FreeBSD sets SMT4 full-time, and appears choose the next logical CPU for additional thread requirements. This results in different power use and elapsed time profiles:

FreeBSD vs Debian Two.png

The differences in power consumption is so significant that an entire wiki entry is now dedicated to it: FreeBSD Debian Power Management. In summary, at all thread counts, Debian is faster than FreeBSD, but at a cost of higher energy use except for when running a single thread. The difference at low thread counts is significant. At high thread counts, FreeBSD uses 1/3 (33%) less energy while only taking 17% longer. Although at low thread counts the lower performance by FreeBSD can be attributed to the use of SMT4 mode all of the time, this is not the full picture.

Power management

Debian defaults to power management on

cpufreq


FreeBSD defaults to power management off

powerd

Isolating Debian and FreeBSD to a single core

Additional Resources

POWER9 User Manual v21

POWER9 Performance Monitoring Unit User Guide v12

POWER CPU Memory Affinity 3 - Scheduling processes to SMT and Virtual Processors

https://www.ibm.com/docs/en/linux-on-systems?topic=linuxonibm/performance/tuneforsybase/smtsettings.htm

George Neville-Neil's brief tutorial on pmcstat

İbrahim Korucuoğlu's How to Profile Applications with `pmcstat` on FreeBSD Operating System

Evaluating the Energy Measurements of the IBM POWER9 On-Chip Controller (uses Linux and hwmon)

Developer access

Raptor Computing Systems supports many different development models from bare metal access (e.g. kernel hacking) through software development. If you are interested in developing with FreeBSD and would like an account on our system, please email support@ this domain (raptorcs.com) with a description of your interests.