SMT profiling with Debian and perf

From RCS Wiki
Jump to navigation Jump to search

This article discusses profiling simultaneous multithreading (SMT) on the POWER9 architecture, using little-endian Debian with perf.

The knowledge presented here was derived from a variety of sources which can be found in the #Additional Resources section.

Simultaneous multithreading (SMT) principles

SMT is "multi-thread" not "multi-core"

This is an important distinction. SMT is a technology that increases throughput of instructions through parallelization where there are under-used CPU components. While SMT4 can support four threads per core and SMT8 can support eight threads per core, this is not an additional three and seven cores, respectively. There are trade-offs and benefits. Per-thread performance declines with increasing utilization of SMT levels, but overall performance and power consumption efficiency increase. An "eight core SMT4-capable system" will have eight cores and can execute up to 32 threads in SMT4 mode, or only eight threads in SMT1 mode. For tasks that can be parallelized, such as idle background daemons waking up or with pthreads to divide data analysis into segments, overall execution time will decrease. However, the performance of each thread in SMT4 mode will be slower than the performance of each thread in SMT1 mode, and the effect is non-linear. SMT4 will take more than 1/4 of the time it would have taken in SMT1 mode. The SMT mode during execution makes a difference on the overall performance. Note that IBM did not market SMT as "multi-core," while several media sites conflated SMT with increased core count.

SMT re-uses spare resources

Superficially, SMT re-uses over-provisioned resources to increase total throughput, at the expense of individual thread performance. For example, a POWER9 core provides

"Support for up to 40 predicted taken branches in-flight,STmode. Twenty predicted taken branches per thread in SMT2 mode and ten predicted taken branches per thread in SMT4 mode"

(page 31 POWER9 Performance Monitoring Unit User Guide v12).

Section 5.2 Partitioning of Resources in Different SMT Modes of the POWER9 Processor User's Manual (page 139) shows how different execution units are shared in different SMT modes.

Section 5.2 P9 UM.png

This sharing allows for overall increased throughput, but not as if there were three additional cores. It's more like four people carrying eight bricks by hand each instead of one person moving forty at a time with a wheelbarrow. Over a given period of time, four people with lighter loads and moving faster will deliver more bricks, but the per-person average is lower. And the math in the analogy above is correct - some compromises are made to achieve the higher overall throughput.

For example, one important aspect to enabling this extra throughput is that instruction prefetching is turned off in SMT4 mode. Page 338 of the Power9 Processor User's Manual states

An instruction prefetch mechanism is used to fetch additional lines after an I-cache miss is detected. The instruction prefetcher uses the I-cache miss history to make decisions about the depth of prefetching on a per miss basis, ranging from 0 - 7 lines ahead. The mechanism is not active in SMT4 mode. The bandwidth of the instruction prefetcher is extended when the sibling core is inactive.

If there are two hundred bricks to be moved, four people carrying eight at a time are a faster option if they can do six trips in the time it takes one person to load a wheelbarrow and move it five times. However, that might not be the best way to do things when there are only twenty bricks to move and the four people have to wait to be told what to do next each trip because they can't look ahead. It's not four people with four wheelbarrows. It's one person with a wheelbarrow and a plan, or four people with two hands each but no plans. (It's just a superficial analogy to convey the concepts and dispel the "multi-core" myth erroneously attributed to SMT.)

Performance Monitor Counters (PMCs)

PMCs are essential for profiling your code. The full breadth of PMCs is beyond the scope of this article, but essentially they count everything. For example, if your code regularly accesses memory, PMCs can tell you how often each data cache has a miss and needs to load a pipeline, which will cause a stall. This might lead you to reorganizing your data. If you can't measure it, you can't objectively know if you are making improvements.

The POWER9 CPU has a vast number of countable events in the six PMCs ("The POWER9 core supports in excess of 1000 events," page 34 POWER9 Performance Monitoring Unit User Guide v12). Each SMT thread-core has a set of six PMCs. In SMT1 mode, each core has one set of PMCs, and in SMT4 mode they have four. In this article we will focus on measuring how often the cores are in SMT4 mode and the impact on performance.

Documentation on the POWER9 PMCs is at POWER9 Performance Monitoring Unit User Guide v12. Of particular note is the discussion of the POWER9 core and its functional units as discussed in Unit 4, and the organization of PMC events as discussed in Section 5.9, CPI Stack Events.

Operating Environment

Hardware

The bare metal hardware for Debian is an eight core SMT4-capable Talos II system.

# lscpu
Architecture:                ppc64le
  Byte Order:                Little Endian
CPU(s):                      32
  On-line CPU(s) list:       0-31
Model name:                  POWER9 (raw), altivec supported
  Model:                     2.3 (pvr 004e 1203)
  Thread(s) per core:        4
  Core(s) per socket:        8
  Socket(s):                 1
  Frequency boost:           enabled
  CPU(s) scaling MHz:        67%
  CPU max MHz:               3800.0000
  CPU min MHz:               2166.0000
Caches (sum of all):         
  L1d:                       256 KiB (8 instances)
  L1i:                       256 KiB (8 instances)
NUMA:                        
  NUMA node(s):              1
  NUMA node0 CPU(s):         0-31
Vulnerabilities:             
  Gather data sampling:      Not affected
  Indirect target selection: Not affected
  Itlb multihit:             Not affected
  L1tf:                      Mitigation; RFI Flush, L1D private per thread
  Mds:                       Not affected
  Meltdown:                  Mitigation; RFI Flush, L1D private per thread
  Mmio stale data:           Not affected
  Reg file data sampling:    Not affected
  Retbleed:                  Not affected
  Spec rstack overflow:      Not affected
  Spec store bypass:         Mitigation; Kernel entry/exit barrier (eieio)
  Spectre v1:                Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
  Spectre v2:                Mitigation; Software count cache flush (hardware accelerated), Software link stack flush
  Srbds:                     Not affected
  Tsx async abort:           Not affected

# lsmem
RANGE                                 SIZE  STATE REMOVABLE BLOCK
0x0000000000000000-0x00000007ffffffff  32G online       yes  0-31

Memory block size:         1G
Total online memory:      32G
Total offline memory:      0B

# facter | grep virtual
is_virtual => false
virtual => physical


Debian identifies the hardware as having "32 cores." This is not correct; it is eight POWER9 CPUs with SMT4 capability.

Operating Systems

Two environments will be used for the operating environment. One is the bare metal systems as discussed above, and a QEMU VM to be discussed later, both of which runs Debian:

# uname -a
Linux 93-224-155-23 6.1.0-37-powerpc64le #1 SMP Debian 6.1.140-1 (2025-05-22) ppc64le GNU/Linux

Profiling software

perf

perf is part of the package linux-perf. It does need to be run as root. "perf" is the overall front-end to the commands it orchestrates:

# perf --help

 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   c2c             Shared Data C2C/HITM Analyzer.
   config          Get and set variables in a configuration file.
   daemon          Run record sessions on background
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   ftrace          simple wrapper for kernel's ftrace functionality
   inject          Filter to augment the events stream with additional information
   iostat          Show I/O performance metrics
   kallsyms        Searches running kernel for symbols
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   kwork           Tool to trace/measure kernel work properties (latencies)
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   version         display the version of perf binary
   probe           Define new dynamic tracepoints
   trace           strace inspired tool

 See 'perf help COMMAND' for more information on a specific command.

In addition to the 'perf help COMMAND' there is 'man perf', which will lead to other man pages:

#man perf
...
SEE ALSO
       perf-stat(1), perf-top(1), perf-record(1), perf-report(1), perf-list(1)

'perf list' gives the PMC events that can be monitored:

# perf list

List of pre-defined events (to be used in -e or -M):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]

...

The default profiling is efficient. Using cmp_mp above, with a minimum length of twelve consecutive nucleotides and using eight threads with a silent output,

# perf stat ./cmp_mp -l 12 -j 8 -s CVD_OM002793.1 CVD_OM003364.1
num args: 8
-l optarg = 12
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 8, map_size = 7441 (0x1d11), slice_mod 0, against_sz 29764
number of sequences: 225
longest sequence: 2848 ([19475, 19475), (22322, 22322)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_8_12_0_100_nuc.csv

 Performance counter stats for './cmp_mp -l 12 -j 8 -s CVD_OM002793.1 CVD_OM003364.1':

         11,299.17 msec task-clock               #    7.723 CPUs utilized          
             1,305      context-switches         #  115.495 /sec                   
               107      cpu-migrations           #    9.470 /sec                   
               129      page-faults              #   11.417 /sec                   
    42,267,506,274      cycles                   #    3.741 GHz                      (33.42%)
        85,292,379      stalled-cycles-frontend  #    0.20% frontend cycles idle     (50.16%)
    25,916,474,922      stalled-cycles-backend   #   61.32% backend cycles idle      (16.75%)
    56,555,412,611      instructions             #    1.34  insn per cycle         
                                                 #    0.46  stalled cycles per insn  (33.49%)
    12,217,316,310      branches                 #    1.081 G/sec                    (50.07%)
       758,041,604      branch-misses            #    6.20% of all branches          (16.56%)

       1.463125416 seconds time elapsed

      11.307016000 seconds user
       0.001290000 seconds sys

It took 1.46 seconds to run the analysis, during which time slightly less than eight CPUs were fully utilized, with a measured 1.34 instructions per cycle.

perf record and perf annotate are two additional useful features. For production runs, cmp_mp is compiled without symbols, but for debugging it is. Using only a single thread with the debugger version,

# perf record ./cmp_mp_sym -l 12 -j 2 -T -s CVD_OM002793.1 CVD_OM003364.1
num args: 9
-l optarg = 12
multithreading disabled
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
number of sequences: 222
longest sequence: 3270 ([19475, 19475), (22744, 22744)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv
[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 1.724 MB perf.data (44906 samples) ]

This can be analyzed with perf annotate:

Perf ann cmp mp sym.png

This immediately identifies two stalls; one ld (load doubleword) and one std (store doubleword), associated with the line of code

map->comparisons++;

A quick check of the code shows this is inside the comparison loop and so executes with every single comparison. This is a major bottleneck which can be addressed by altering from a memory-resident variable to a local (automatic) variable, with the final value copied to its destination. This changes the process from reading from and writing to memory to incrementing a register, of which there are plenty. Although not explored in this article, the said change resulted in a 12% gain in performance on four different RISC-based CPUs (SG2000, ARMv7, PowerPC G4, and POWER9), but cost 12% in performance on a CISC-based i7 (MacBook Pro). As it were, the C11 standard prevents compilers from making this optimization transparently, because the struct "map" might be altered in a different location. Therefore the programmer needs to know what hardware the code is running on in order to ensure maximum optimization. Perhaps this shall be discussed at another time...

Benchmark code

Description of comparison code (cmp_mp)

A lengthy description of the benchmark code is included in order to show the relevance of this workload. Unlike many benchmarks that test specific functions and might or might not highlight specific performance, this benchmark is a real-world, actual workload with high memory access on a byte-by-byte basis. You can skip ahead to #SMT profiling with Debian and perf if the details are not relevant to your needs.

The code being profiled and used for benchmarking is the genomic comparison code that the M. P. Janson Institute for Analytical Medicine uses to look for molecular mimicry between pathogens and human tissue and hormones. This code uses a variety of techniques to compare nucleotide and amino acid sequences. There are both byte-by-byte and vector versions. The raw source code can be found at TBD

Fundamentally the algorithm just compares characters in two sequences for matching. The goal is to match consecutive characters between two sequences with constantly changing characters, and record the string and location of the matches. Unlike strstr or similar functions, the string to be matched is not known in advance. The first character in one sequence is compared against the first character in the other sequence. If it matches, the second characters are compared. If they match, the third characters are compared, and so on, until the characters do not match. If the length of the matched string is below the minimum desired length, the match is discarded and the comparisons start over at the next character. This can "slide" the comparisons relative to each other, so if the first four characters match but not the fifth, the comparison starts over at the first character of one sequence and the fifth character of the other. If there is a match of sufficient length, that two-dimensional region is marked as an "exclusion region" and not searched again.

The code can run N number of user-specified pthreads in parallel (-j N), by slicing one of the two sequences into smaller pieces and stitching the results together. N must be an even number (one forward, one reverse). Some matches are lost at the boundaries between slices when one or both parts are less than the minimum length. The execution time is extremely sensitive to comparisons made; adding checks to determine if a match falls on a slice boundary adds significant overhead. For large sequences, the odds of a matching string falling on a boundary are very low; for small sequences, running fewer threads is effective without costing significant time delays. For absolute accuracy, run only two threads (one forward and one in reverse).

Description of data

(Skip ahead to #SMT profiling with Debian and perf)

Nucleotides are denoted as A, T, C, and G. There are three nucleotides in an amino acid, and 23 amino acids in humans, which leads to "isocodon" amino acids. These have the same function but different nucleotide combinations. There are amino acids that denote the initiation and termination of a gene. The amino acid is determined by its offset from the initiation amino acid (ATG, methionine), which starts an "Open Reading Frame (ORF)." ORFs can overlap. DNA has a forward ("sense") and backward ("anti-sense") direction, so there are three possible amino acids reading forward, and three when reading backwards. Nucleotide sequences are usually published in the forward direction, and the other strand's nucleotide matching the forward direction is inferred (A matches T, and C matches G, and vice versa). Genome sequences can be obtained from the NIH National Library of Medicine gene database, as well as in other places. The particular sequence is identified by its "accession number," and sequencing of the same organism by different labs does lead to different accession numbers. Some sequences are "canonical," meaning the generally accepted sequence that is often an assembly of accessions, and others are "contributions," meaning independently obtains sequences submitted for recording.

A typical "FASTA" file contains data as

TATATTAGAGTAGGAGCTAGAAAATCAGCACCTTTAATTGAATTGTGCGTGGATGAGGCTGGTTCTAAATCACCCATTCAGTACATCGATATCGGTAATTATACAGTT...

The most common comparison method is BLAST, which uses a database of pre-identified nucleotide sequences. This is useful for rapidly finding large matches, but does not offer granularity for exploring other aspects of the genome. The above method allows matching of nucleotide and amino acid sequences on both position one and position two nucleotide. This often reveals "repeat motifs" where a sequence of nucleotides is repeated multiple times within a pathogen. The current thinking is that this is an "immuno-evasive strategy," allowing the pathogen to escape detection by the host's immune system. Additionally, by recording the location of the matches, the distance between matches can be calculated. This is important in deciding if a host antibody would identify healthy tissue and hormones as pathogenic, by having similar shapes. The above algorithm also allows for detecting "SIDs," which are "Substitutions, Insertions, and Deletions," which occur regularly in rapidly evolving pathogens such as viruses.

For example, relative to the example above

TTATATTAGAGTAGGAGCTAGAAAATCAGCACCTTTAATTGAATTGTGCGTGGATGAGGCTGGTTCTAAATCACCCATTCAGTACATCGATATCGGTAATTATACAGTT...

is identical except for the inserted nucleotide T at the very first character. If the ORF is prior to the sequence above, this changes the amino acid structure by altering the anchor point for examining three nucleotides at a time (a "codon"). A straight matching of amino acids would miss this. The comparison code (cmp_mp) has the ability to shift each nucleotide sequence by none, one and two positions to catch insertions that alter the identifiable amino acids, and can match against the first character ("T" in "TAA") or the second nucleotide ("A" in "TAA"), which is more highly conserved across evolutionary pathways.

Sample output

(Skip ahead to #SMT profiling with Debian and perf)

The output is a csv file:

CVD_OM003364.1,CVD_OM002793.1,225,2848,29762,29762,1,1,
1,1,163,163,164,
...

The first line indicates the sequences compared, the number of matches, the longest match, the length of each sequence, and the presentation form, in this case a 1x1 image. The second line indicates a match was found from (1, 163) in one sequence and (1, 163) in the other, for a length of 164 characters, inclusive.

The csv file is run through some custom image generation code using libjpeg. An example of the processed data is below

Mp Bmiyamotoi CP006647.2 vs Bburgdorferi GCF 000171735.2 20 1 100 nuc.gif

The image shows matching sequences between two bacteria, B. burgdorferi, which causes Lyme disease, and B. miyamotoi, a bacteria in the same Borrelia family. Matches are occurring in both the forward and reverse directions. In this particular case, the B. miyamotoi sequence was assembled from fragments in both the "sense" and "antisense" direction. This is an artifact of how PCR is done, but has to be considered when doing analysis. Running only forward matching can lead to missed matches (a "false negative"). This comparison is particularly important because some people will exhibit symptoms of Lyme disease but only test weakly positive or not at all for Lyme disease. This is known in science but not practiced in medicine, leading to erroneous conclusions about the cause of the patient's symptoms (see Borrelia miyamotoi infection leads to cross-reactive antibodies to the C6 peptide in mice and men).

The B. burgdefori sequence is 1,455,375 nucleotides long, and the B. miyamotoi sequence is 907,293 nucleotides long. Each nucleotide is compared at least once, and when there are partial matches too short to place the nucleotide into an exclusion region, it may be compared more than once. In this case, more than 2.6 trillion comparisons are made (one comparison is forward against forward, and one is forward against reverse; reverse against reverse is the same as forward against forward because of the inverse nucleotide arrangement of DNA). When searching human genes, the comparisons can be on the order of quadrillions (10^15) and take hours even on fast CPUs even with multi-core threading. Optimizing the comparison code saves time and power consumption, and is the focus of the profiling done below.

Baseline comparison

(Skip ahead to #SMT profiling with Debian and perf)

For these benchmarks, two different sequences of Epstein-Barr (EBV) and SARS-CoV-2 (Covid-19) virii will be used. EBV is the virus that causes mononucleosis, but is also closely associated with multiple sclerosis and has identifiable molecular mimicry sequences. SARS-CoV-2, the virus that causes Covid-19, evolved from a coronavirus, as can be shown by a comparison of amino acids (nucleotide shifts of none for CRNA and one for Covid-19, anchor on second position, match initiation, termination and arginine amino acids):

CVD OM002793.1 vs CRNA KU131570.1 2 12 0 100 R2.gif

The COVID-19 sequences, OM002793.1 and OM003364.1 are suitable for short duration benchmarks. The EBV sequences, KC207814.1 and NC_007605.1 (canonical), are for longer benchmarks in which slight performance gains are measured. Each pair member in the two pairs of sequences is nearly identical to the other, so when analysis of branch prediction for misses is needed, one from each pair can be compared against the other.

A typical comparison command looks like

./cmp_mp -l 12 -j 8 -s CVD_OM002793.1 CVD_OM003364.1

It is specifying a minimum length of 12 consecutive nucleotide matches and four forward slices and four reverse slices, with the two Covid-19 sequences. For benchmark clarity, the -s (silence) flag is set. The output for only two threads:

$  time ./cmp_mp -l 12 -j 1 -s CVD_OM002793.1 CVD_OM003364.1
num args: 8
-l optarg = 12
Number of jobs must be even; adding +1 to n_jobs
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
number of sequences: 222
longest sequence: 3270 ([19475, 19475), (22744, 22744)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv
        5.80 real        11.19 user         0.00 sys

This is the most accurate baseline comparison. With two threads, there are no slices. For absolute single-threaded execution, add -T to the flags.

To show the performance for additional threads, increase the number of jobs to the core count (4):

$  time ./cmp_mp -l 12 -j 4 -s CVD_OM002793.1 CVD_OM003364.1
num args: 8
-l optarg = 12
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 4, map_size = 14882 (0x3a22), slice_mod 0, against_sz 29764
number of sequences: 223
longest sequence: 3270 ([19475, 19475), (22744, 22744)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_4_12_0_100_nuc.csv
        2.91 real        11.21 user         0.00 sys

A change to -j 16, the maximum thread-core count, yields

$  time ./cmp_mp -l 12 -j 16 -s CVD_OM002793.1 CVD_OM003364.1
num args: 8
-l optarg = 12
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 16, map_size = 3720 (0xe88), slice_mod 4, against_sz 29764
number of sequences: 226
longest sequence: 2845 ([19475, 19475), (22319, 22319)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_16_12_0_100_nuc.csv
        1.05 real        15.39 user         0.03 sys

Notice the impact of slicing - there are more matching sequences, and the longest sequence is shorter.

There can be a minor performance benefit by double-loading the thread-core count:

$  time ./cmp_mp -l 12 -j 32 -s CVD_OM002793.1 CVD_OM003364.1
num args: 8
-l optarg = 12
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 32, map_size = 1860 (0x744), slice_mod 4, against_sz 29764
thread[1] had no matches
number of sequences: 234
longest sequence: 1860 ([13020, 13020), (14879, 14879)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_32_12_0_100_nuc.csv
        1.02 real        15.60 user         0.02 sys

Post-processing output:

Mp CVD OM003364.1 vs CVD OM002793.1 18 0 100 nuc.gif

For completeness, the EBV against EBV comparison:

Mp EBV KC207814.1 vs EBV NC 007605.1 12 0 100 nuc.gif

It shows widely spread "repeat motifs" that are shared among different strains. The same (but different) sequences are found repeatedly within the whole organism's genomic sequence. The function of repeat motifs is not completely understood, but may provide a shape or sequence that helps evade host immune system responses. EBV is a member of the Herpes family, remains latent in B-cells, and can reactivate periodically, just like varicella zoster (Shingles/chickenpox) and oral and genital herpes simplex.

SMT profiling with Debian and perf

Baseline profiling

To establish a baseline, we will use the Debian OS on bare metal as described earlier, and to focus on the areas where SMT affects performance, we will use six specific events: instructions, cycles, L1-icache-prefetches, L1-dcache-prefetches, pm_run_cyc_st_mode, and pm_run_cyc_smt4_mode. This should cover stalls due to cache-misses and how often the cores are in SMT1 and SMT4 mode.

In perf, the -e flag specifies the events of interest, but can be listed together if separated by commas. Specifying any events with -e means all events to be studied have to be specified. perf stat also has a very nice feature, -r/--repeat N, which allows to repeating the profiled code N times and collecting the average values with variance. The following profiling uses -r 100 and includes seven events of interest:

# perf stat -e task-clock,instructions,cycles,L1-icache-prefetches,L1-dcache-prefetches,pm_run_cyc_st_mode,pm_run_cyc_smt4_mode -r 100 ./cmp_mp -l 12 -j 32 -s CVD_OM002793.1 CVD_OM003364.1
num args: 8
-l optarg = 12
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
...
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_32_12_0_100_nuc.csv

 Performance counter stats for './cmp_mp -l 12 -j 32 -s CVD_OM002793.1 CVD_OM003364.1' (100 runs):

         21,446.21 msec task-clock              #   30.779 CPUs utilized     ( +-  0.02% )
    56,941,217,151      instructions            #    0.74  insn per cycle    ( +-  0.02% )  (49.73%)
    77,085,115,514      cycles                  #    3.610 GHz               ( +-  0.02% )  (66.55%)
           162,324      L1-icache-prefetches    #    7.603 K/sec             ( +-  0.52% )  (83.31%)
           251,264      L1-dcache-prefetches    #   11.768 K/sec             ( +-  0.32% )  (67.09%)
        55,370,664      pm_run_cyc_st_mode      #    2.593 M/sec             ( +- 13.46% )  (83.63%)
    77,209,022,034      pm_run_cyc_smt4_mode    #    3.616 G/sec             ( +-  0.04% )  (33.05%)

           0.69678 +- 0.00135 seconds time elapsed  ( +-  0.19% )

The inference is that with 100 samples, differences beyond the variances are highly likely to be real. Note that the instructions per cycle has dropped dramatically from the previous 1.34 to 0.74, while the task-clock went from 7.723 utilized CPUs to 30.779 CPUs. Also note that even though four times more threads were used in parallel, the elapsed time only dropped from 1.46s to 0.70s, a factor of 0.48, not 0.25.

The data below is from increasing the thread count on the benchmark comparison and recording the results.

Benchmarking prefetching and instructions per cycle

SMT4 turns off instruction cache prefetching. Page 338 of the POWER9 Processor User's Manual states

An instruction prefetch mechanism is used to fetch additional lines after an I-cache miss is detected. The instruction prefetcher uses the I-cache miss history to make decisions about the depth of prefetching on a per miss basis, ranging from 0 - 7 lines ahead. The mechanism is not active in SMT4 mode. The bandwidth of the instruction prefetcher is extended when the sibling core is inactive.

This can be seen in the graph below:

Thread count versus instruction cache prefetches and instructions per cycle

The instructions per cycle drop off in a beneficially non-linear response to decreases in prefetching instructions. During the benchmark run with less than eight threads being satisfied by remaining in SMT1 mode, the cores are executing on average 1.33 instructions per cycle (read on the right-hand y-axis), and approximately 1.2 million total i-cache prefetches (read on the left-hand axis). At 32 threads, instructions per cycle have dropped approximately 43%, to 0.74 instructions per cycle, while total i-cache prefetches have dropped by 86% to approximately 165,000.

Note that data cache prefetching actually increases with additional thread count, up to a maximum at 32 threads.

Below is the percent of time spent in SMT[1|2|4] mode on the 8 core bare metal machine as a function of thread count. The graph on the left shows the percent on a linear axis, while the graph on the right shows the percent on a log (base 10) axis. The log axis smooths out the drop in percent, but it is important to remember that the change in percent is non-linear. Zero is represented as "0.001" on the log scale, rather than omitted. Percentages were determined by using the total counts returned by pm_run_cyc_st_mode and pm_run_cyc_smt4_mode divided by the total cycles. In some cases >100% were calculated, but rounded down to 100%. SMT2 is inferred by 100-(% time in SMT1)-(% time in SMT4) since there is not a specific event counter for it.

thread count vs. % SMTthread count vs. % SMT

As can be seen in the three graphs, there are distinct changes in modes at 8,16, and 32, corresponding to the necessity of transitioning from SMT1 to SMT2 at demands above eight threads, and against from SMT2 to SMT4 at demands above sixteen threads. There is not perfect 100% SMT1 usage below eight threads, due to scheduling of background processes.

Turning off and on SMT4

The package powerpc-ibm-utils has several useful components:

# apt info powerpc-ibm-utils
Package: powerpc-ibm-utils
Version: 1.3.10-2
Priority: important
Section: utils
Source: powerpc-utils
Maintainer: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Installed-Size: 1,804 kB
Depends: bash (>= 3), bc, libc6 (>= 2.34), libnuma1 (>= 2.0.11), librtas2 (>= 1.3.6), librtasevent2 (>= 1.3.6), zlib1g (>= 1:1.1.4)
Suggests: ipmitool, sysfsutils
Homepage: http://powerpc-utils.ozlabs.org/
Download-Size: 303 kB
APT-Manual-Installed: yes
APT-Sources: http://deb.debian.org/debian bookworm/main ppc64el Packages
Description: utilities for maintenance of IBM PowerPC platforms
 The powerpc-ibm-utils package provides the utilities listed below
 which are intended for the maintenance of PowerPC platforms. Further
 documentation for each of the utilities is available from their
 respective man pages.
 .
  * nvram - NVRAM access utility
  * bootlist - boot configuration utility
  * ofpathname - translate logical device names to OF names
  * snap - system configuration snapshot
  * ppc64_cpu - cpu settings utility

Within ppc64_ppc is an ability to set the maximum SMT mode:

# ppc64_cpu --smt=1
# lscpu
Architecture:                ppc64le
  Byte Order:                Little Endian
CPU(s):                      32
  On-line CPU(s) list:       0,4,8,12,16,20,24,28
  Off-line CPU(s) list:      1-3,5-7,9-11,13-15,17-19,21-23,25-27,29-31
...
# ppc64_cpu --smt=4
# lscpu
Architecture:                ppc64le
  Byte Order:                Little Endian
CPU(s):                      32
  On-line CPU(s) list:       0-31
...

Note the order of CPUs that go "offline" when SMT=1. This provides a hint into the scheduling of processes within the kernel (Linux-based Debian). This functionality is read-only on virtual machines; the actual host controls the real mode.

Altering the SMT mode allows for more profiling of SMT effects. For example, after setting SMT=2, the benchmark code with 32 threads runs

# ppc64_cpu --smt=2
# lscpu
Architecture:                ppc64le
  Byte Order:                Little Endian
CPU(s):                      32
  On-line CPU(s) list:       0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29
  Off-line CPU(s) list:      2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31
...

# perf stat -e instructions,cycles,L1-icache-prefetches,L1-dcache-prefetches,pm_run_cyc_st_mode,pm_run_cyc_smt4_mode -r 100 ./cmp_mp -l 12 -j 32 -s CVD_OM002793.1 CVD_OM003364.1
...
 Performance counter stats for './cmp_mp -l 12 -j 32 -s CVD_OM002793.1 CVD_OM003364.1' (100 runs):

    57,106,151,065      instructions           #    1.04  insn per cycle           ( +-  0.02% )  (49.90%)
    55,169,308,915      cycles                                                     ( +-  0.03% )  (66.57%)
           363,832      L1-icache-prefetches                                       ( +-  0.34% )  (83.30%)
           178,460      L1-dcache-prefetches                                       ( +-  0.30% )  (66.83%)
       619,051,411      pm_run_cyc_st_mode                                         ( +-  6.42% )  (83.39%)
                 0      pm_run_cyc_smt4_mode                                       (33.39%)

           0.99490 +- 0.00202 seconds time elapsed  ( +-  0.20% )

pm_run_cyc_smt4_mode shows zero cycles in SMT4, but 619,051,411 cycles in SMT1 mode is only 1.1% of the total cycles, justifying the previous inference that 100-SMT1-SMT2=SMT2 cycles.

Effects of SMT[1|2|4] on elapsed time and instructions per cycle

The effect of each SMT mode can be measured to obtain a clear picture of the performance gains that are available with having additional threads of lesser performance. For the tests below, the upper limit of SMT mode was set with ppc64_cpu.

Threads vs. time.png Threads vs. time cropped.png

The two graphs above show the effects of increasing thread count at different SMT modes. The total elapsed time is on the left y-axis, while the instructions per cycle (across all threads) is on the right hand y-axis. The above two graphs are the same, except the bottom graph is cropped to 2 seconds total elapsed time.

The SMT modes have identical instructions per cycle, 1.33, through eight threads, which is the same as the core count. Once the thread count gets above the core count, additional overhead is incurred. In SMT1 mode, this means threads are switched in and out of a core. The instructions per cycle remains at 1.33, with a minimum elapsed time of 1.46 seconds. For SMT2 and SMT4 modes, at the expense of instruction prefetching and instructions per cycle, the symmetric multithreading capacity utilizes function units in parallel. For SMT2 mode, the minimum elapsed time is 0.97 seconds and 1.04 instructions per cycle, at 16 threads, or twice the core count. For SMT4 mode, the minimum elapsed time is 0.75 second and 0.7 second and 0.74 instructions per cycle, at 32 threads, or four times the core count. With all three modes, there are occasional improvements but in general the peak performance is at the mode times the core count.

Metric SMT1 SMT2 SMT4
Lowest elapsed time 1.46s 0.97s 0.70s
Instructions per cycle at best performance 1.33 0.97 0.7

This is almost counter-intuitive. Fewer instructions per cycle resulted in lower total elapsed time. Recall that this is not a true "multi-core," where SMT4 means four cores. It is adding additional performance by utilizing parts of the core that are empty in SMT1 mode. More threads are being run, and therefore more throughput is obtained. It is not a perfectly linear gain, of course. The lowest elapsed time in SMT1 mode was 1.46 seconds, but 0.7 seconds on SMT4 mode, not 0.365 seconds if it were a true four core boost. This represents a 52% decrease in elapsed time, not a 75% decrease.

The second graph, on the bottom, highlights the proportionate gain in performance. By coincidence, the scales of the three modes converged when examining only eight cores and more. This shows a close correlation between fewer instructions per cycle but better performance overall.

Power consumption for SMT[1|2|4]

The obvious question becomes "Is this a free ride? What is the cost in energy for the additional 50% gain in performance?"

Power consumption can be measured through the internal sensor feature in Debian and the BMC sensors on the chassis. In this case, since some of the benchmark tests are so short, the very long benchmark of comparing EBV against chromsome 1 was used. This allows for a "steady-state" power measurement over a period of time. The bare metal machine was put into each of the three modes and the "most commonly noted" power consumption was recorded. This is not an exact average or mean quantitative analysis. The benchmark comparison would be initiated, the command 'sensors | grep W' would be executed several times during the run, and the BMC sensor would be refreshed repeatedly. If the number was consistently the same, fewer data samples would be taken, but more would be taken if variance or contrary results were noticed. The benchmark would be stopped, incremented to the next thread count, and initiated, while the results of the previous run were recorded. The sampling would start over.

Idle power consumption was measured at 32 watts. Maximum power consumption was measured at 130 watts.

Threads vs. power.png

The clear statement is that the power consumption increases linearly as additional cores are utilized, up to the maximum of eight, and that going into SMT2 and SMT4 do not cost significantly additional power. The nearly 50% decrease in elapsed time comes at a cost of only a few additional watts. This can be seen more clearly with a focused view showing only thread counts equal and greater than eight:

Threads vs power cropped.png

There is a small but measurable increase in power consumption with higher SMT modes, and more bounce in the variance. Above SMT mode level (i.e. 2X threads as thread-cores), the variance could be as high as seven watts, from a maximum of 136 watts to a low of 129 watts. This variance appeared to increase with an increase in SMT mode and the thread count, wherein the peak and low power consumption readings appeared equally as often. There were frequent thread count points where the power consumption noticeably bounced more, and even declined, with higher SMT modes and thread counts. Rather than conclude this is increased efficiency, it is possible this is greater inefficiency and the cores are idle more often during thread switching or other scheduler-related functions.

Of additional note is that the performance gain going from SMT1 to SMT2 is greater than going from SMT2 to SMT4. The power consumption appears to increase with SMT2 but decrease with SMT4. This bears further investigation at a later time.


Bare metal vs. Virtual Machines

Effects of host loads on SMT in client VMs

SMT mode is set by the host running virtual machines. To show an example of how this impacts the client VM, we will launch a QEMU VM with Debian and 1 core but four threads, and then load the host with a long benchmark that uses 30 threads. That still leaves two thread-cores for responsiveness.

The QEMU VM command:

# qemu-system-ppc64 --enable-kvm -nographic -M pseries,cap-nested-hv=true,cap-ibs=workaround -cpu host -smp cores=1,threads=4,sockets=1 -m 8G -nic user,ipv6=off,model=e1000,mac=52:54:98:76:54:32,hostfwd=tcp::2222-:22 debian.raw

The QEMU VM client:

# lscpu
Architecture:                ppc64le
  Byte Order:                Little Endian
CPU(s):                      4
  On-line CPU(s) list:       0-3
Model name:                  POWER9 (architected), altivec supported
  Model:                     2.3 (pvr 004e 1203)
  Thread(s) per core:        4
  Core(s) per socket:        1
  Socket(s):                 1
# uname -a
Linux debian 6.1.0-37-powerpc64le #1 SMP Debian 6.1.140-1 (2025-05-22) ppc64le GNU/Linux

For the baseline on the client, we use a short benchmark (run 10 times):

# perf stat -e task-clock,instructions,cycles,L1-icache-prefetches,L1-dcache-prefetches,pm_run_cyc_st_mode,pm_run_cyc_smt4_mode -r 10 ./cmp_mp -l 12 -j 32 -s CVD_OM002793.1 CVD_OM003364.1
...
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_32_12_0_100_nuc.csv

 Performance counter stats for './cmp_mp -l 12 -j 32 -s CVD_OM002793.1 CVD_OM003364.1' (10 runs):

         11,317.40 msec task-clock               #    3.985 CPUs utilized       ( +-  0.09% )
    56,565,697,976      instructions             #    1.33  insn per cycle      ( +-  0.07% )  (49.81%)
    42,562,384,883      cycles                   #    3.757 GHz                 ( +-  0.09% )  (66.73%)
         1,556,239      L1-icache-prefetches     #  137.357 K/sec               ( +-  0.97% )  (83.45%)
           394,039      L1-dcache-prefetches     #   34.779 K/sec               ( +-  1.62% )  (67.16%)
    42,694,594,123      pm_run_cyc_st_mode       #    3.768 G/sec               ( +-  0.26% )  (83.61%)
           289,256      pm_run_cyc_smt4_mode     #   25.530 K/sec               ( +-207.44% )  (33.08%)

           2.84011 +- 0.00261 seconds time elapsed  ( +-  0.09% )

Notice that the 1.33 instructions per cycle is consistent with SMT1 mode, even though the number of threads consumed 800% of the client CPUs. Occasionally a core was put into SMT4 mode, but overwhelmingly the cores were in SMT1 mode. Now we load the host machine with a long benchmark:

# perf stat -e task-clock,instructions,cycles,L1-icache-prefetches,L1-dcache-prefetches,pm_run_cyc_st_mode,pm_run_cyc_smt4_mode -r 1 ./cmp_mp -l 12 -j 30 -s chromosome_NC_000001.11 EBV_NC_007605.1

and rerun the benchmark on the client:

# perf stat -e task-clock,instructions,cycles,L1-icache-prefetches,L1-dcache-prefetches,pm_run_cyc_st_mode,pm_run_cyc_smt4_mode -r 10 ./cmp_mp -l 12 -j 32 -s CVD_OM002793.1 CVD_OM003364.1
...
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_32_12_0_100_nuc.csv

 Performance counter stats for './cmp_mp -l 12 -j 32 -s CVD_OM002793.1 CVD_OM003364.1' (10 runs):

         21,689.29 msec task-clock                #    4.009 CPUs utilized      ( +-  0.11% )
    56,755,363,259      instructions              #    0.74  insn per cycle     ( +-  0.04% )  (50.25%)
    76,981,252,278      cycles                    #    3.566 GHz                ( +-  0.07% )  (66.72%)
           210,227      L1-icache-prefetches      #    9.739 K/sec              ( +-  6.18% )  (83.21%)
         1,116,568      L1-dcache-prefetches      #   51.724 K/sec              ( +-  2.17% )  (66.64%)
                 0      pm_run_cyc_st_mode        #    0.000 /sec                              (83.47%)
    77,001,577,754      pm_run_cyc_smt4_mode      #    3.567 G/sec              ( +-  0.06% )  (33.47%)

           5.41073 +- 0.00653 seconds time elapsed  ( +-  0.12% )

Still only four CPUs utilized (with greater utilization), but the 0.74 instructions per cycle are reflective of the 100% running in SMT4 mode. The 5.4 seconds to complete the benchmark is comparable to an unloaded host running only two threads. However, this is coincidental, as even when the host was running 32 threads for the long comparison, the client executed the task in 5.4 seconds.

Conclusions

Managing SMT modes efficiently is a kernel-level task, but the programmer should know the management approach in order to maximize performance. Tools exist to lock processes to specific CPUs, although these were not examined in this article. It is key that the programmer understand that "simultaneous multi-thread" is not "multi-core." There are trade-offs for additional performance. SMT is a fascinating technology, and hopefully the reader has a greater understanding of it after reading this article and experimenting on their own.

Additional Resources

POWER9 User Manual v21

POWER9 Performance Monitoring Unit User Guide v12

POWER CPU Memory Affinity 3 - Scheduling processes to SMT and Virtual Processors

https://www.ibm.com/docs/en/linux-on-systems?topic=linuxonibm/performance/tuneforsybase/smtsettings.htm

Wikipedia article on Simultaneous Multithreading

Developer access

Raptor Computing Systems supports many different development models from bare metal access (e.g. kernel hacking) through software development. If you are interested in developing with Debian and would like an account on our system, please email support@ this domain (raptorcs.com) with a description of your interests.