Difference between revisions of "SMT profiling with pmcstat and perf"

Revision as of 10:19, 15 July 2025

This article discusses profiling symmetric multithreading (SMT) on the POWER9 architecture. It uses both Big-endian FreeBSD with pmcstat and Debian with perf.

The knowledge presented here was derived from a variety of sources which can be found in the #Additional Resources section.

This is an important distinction. SMT is a technology that increases throughput of instructions through parallelization where there are under-used CPU components. While SMT4 can support four threads per core and SMT8 can support eight threads per core, this is not an additional three and seven cores, respectively. There are trade-offs and benefits. Per-thread performance declines with increasing utilization of SMT levels, but overall performance and power consumption efficiency increase. Note that IBM did not market SMT as "multi-core," while several media sites conflated SMT with increased core count.

Comparison to RISC-V HARTs

Benchmark code

The code being profiled and used for benchmarking is the genomic comparison code that the M. P. Janson Institute for Analytical Medicine uses to look for molecular mimicry between pathogens and human tissue and hormones. This code uses a variety of techniques to compare nucleotide and amino acid sequences. There are both byte-by-byte and vector versions. The raw source code can be found at TBD

Fundamentally the algorithm just compares characters in two sequences for matching. The goal is to match consecutive changing characters and record the string and location of the matches. Unlike strstr or similar functions, the string to be matched is not known in advance. The first character in one sequence is compared against the first character in the other sequence. If it matches, the second characters are compared. If they match, the third characters are compared, and so on, until the characters do not match. If the length of the matched string is below the minimum desired length, the match is discarded and the comparisons start over at the next character. This can "slide" the comparisons relative to each other, so if the first four characters match but not the fifth, the comparison starts over at the first character of one sequence and the fifth character of the other. If there is a match of sufficient length, that two-dimensional region is marked as an "exclusion region" and not searched again.

Nucleotides are denoted as A, T, C, and G. There are three nucleotides in an amino acid, and 23 amino acids in humans, which leads to "isocodon" amino acids. These have the same function but different nucleotide combinations. There are amino acids that denote the initiation and termination of a gene. The amino acid is determined by its offset from the initiation amino acid (ATG, methionine), which starts an "Open Reading Frame (ORF)." ORFs can overlap. DNA has a forward ("sense") and backward ("anti-sense") direction, so there are three possible amino acids reading forward, and three when reading backwards. Nucleotide sequences are usually published in the forward direction, and the other strand's nucleotide matching the forward direction is inferred (A matches T, and C matches G, and vice versa). Genome sequences can be obtained from the NIH National Library of Medicine gene database, as well as in other places. The particular sequence is identified by its "accession number," and sequencing of the same organism by different labs does lead to different accession numbers. Some sequences are "canonical," meaning the generally accepted sequence that is often an assembly of accessions, and others are "contributions," meaning independently obtains sequences submitted for recording.

The most common comparison method is BLAST, which uses a database of pre-identified nucleotide sequences. This is useful for rapidly finding large matches, but does not offer granularity for exploring other aspects of the genome. The above method allows matching of nucleotide and amino acid sequences on both position one and position two nucleotide. This often reveals "repeat motifs" where a sequence of nucleotides is repeated multiple times within a pathogen. The current thinking is that this is an "immuno-evasive strategy," allowing the pathogen to escape detection by the host's immune system. Additionally, by recording the location of the matches, the distance between matches can be calculated. This is important in deciding if a host antibody would identify healthy tissue and hormones as pathogenic, by having similar shapes. The above algorithm also allows for detecting "SIDs," which are "Substitutions, Insertions, and Deletions," which occur regularly in rapidly evolving pathogens such as viruses.

The output is a csv file that is run through some custom image generation code using libjpeg. An example of the processed data is below

Mp Bmiyamotoi CP006647.2 vs Bburgdorferi GCF 000171735.2 20 1 100 nuc.gif

The image shows matching sequences between two bacteria, B. burgdorferi, which causes Lyme disease, and B. miyamotoi, a bacteria in the same Borrelia family. Matches are occurring in both the forward and reverse directions. In this particular case, the B. miyamotoi sequence was assembled from fragments in both the "sense" and "antisense" direction. This is an artifact of how PCR is done, but has to be considered when doing analysis. Running only forward matching can lead to missed matches (a "false negative"). This comparison is particularly important because some people will exhibit symptoms of Lyme disease but only test weakly positive or not at all for Lyme disease. This is known in science but not practiced in medicine, leading to erroneous conclusions about the cause of the patient's symptoms (see Borrelia miyamotoi infection leads to cross-reactive antibodies to the C6 peptide in mice and men).

The B. burgdefori sequence is 1,455,375 nucleotides long, and the B. miyamotoi sequence is 907,293 nucleotides long. Each nucleotide is compared at least once, and when there are partial matches too short to place the nucleotide into an exclusion region, it may be compared more than once. In this case, more than 2.6 trillion comparisons are made (one comparison is forward against forward, and one is forward against reverse; reverse against reverse is the same as forward against forward because of the inverse nucleotide arrangement of DNA). When searching human genes, the comparisons can be on the order of quadrillions (10^15) and take hours even on fast CPUs even with multi-core threading. Optimizing the comparison code saves time and power consumption, and is the focus of the profiling done below.

pmcstat

Default setting is -w 5 for five second output with per interval incrementing. -C for cumulative counting.

time  pmcstat -w 1 -v -p pm_l1_icache_miss ./cmp_mp -l 12 -j 2 -s CVD_OM002793.1 CVD_OM003364.1

No load w/ smt=1

$ time  pmcstat -w 1 -p pm_l1_icache_miss ./cmp_mp -l 12 -j 2 -T -s CVD_OM002793.1 CVD_OM003364.1
num args: 9
-l optarg = 12
multithreading disabled
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
# p/pm_l1_icache_miss 
                30312 
                    0 
                    0 
                    0 
                    0 
                 1459 
                    0 
                    0 
                    0 
                    0 
                    0 number of sequences: 222
longest sequence: 3270 ([19475, 19475), (22744, 22744)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv

                 2399 
       11.18 real         0.01 user         0.00 sys

No load w/ smt=4

$ time  pmcstat -w 1 -p pm_l1_icache_miss ./cmp_mp -l 12 -j 2 -T -s CVD_OM002793.1 CVD_OM003364.1
num args: 9
-l optarg = 12
multithreading disabled
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
# p/pm_l1_icache_miss 
                47223 
                    0 
                90369 
                    0 
                90800 
                64188 
                    0 
                    0 
                    0 
                    0 
                    0 number of sequences: 222
longest sequence: 3270 ([19475, 19475), (22744, 22744)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv

                 2411 
       11.18 real         0.01 user         0.00 sys

Load w/ smt=1

 
$ time  pmcstat -w 1 -p pm_l1_icache_miss ./cmp_mp -l 12 -j 2 -T -s CVD_OM002793.1 CVD_OM003364.1
num args: 9
-l optarg = 12
multithreading disabled
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
# p/pm_l1_icache_miss 
                30449 
                    0 
                    0 
                    0 
                    0 
                 1493 
                    0 
                    0 
                    0 
                    0 
                    0 number of sequences: 222
longest sequence: 3270 ([19475, 19475), (22744, 22744)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv

                 2569 
       11.17 real         0.00 user         0.00 sys

Load w/ smt=4

$ time  pmcstat -w 1 -p pm_l1_icache_miss ./cmp_mp -l 12 -j 2 -T -s CVD_OM002793.1 CVD_OM003364.1
num args: 9
-l optarg = 12
multithreading disabled
input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
o_st.st_size: 29766 , t_st.st_size: 29766
min_pct_f = 1.00
min_len = 12
n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
# p/pm_l1_icache_miss 
                61484 
                    0 
                    0 
                    0 
                    0 
                    0 
                    0 
                    0 
                    0 
                 2348 
                    0 
                    0 
                    0 
                    0 
                    0 
                    0 
                    0 
                    0 
                    0 number of sequences: 222
longest sequence: 3270 ([19475, 19475), (22744, 22744)]
file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv

                 7596 
       19.84 real         0.00 user         0.01 sys

perf

Additional Resources

POWER9 User Manual v21

POWER9 Performance Monitoring Unit User Guide v12

POWER CPU Memory Affinity 3 - Scheduling processes to SMT and Virtual Processors

https://www.ibm.com/docs/en/linux-on-systems?topic=linuxonibm/performance/tuneforsybase/smtsettings.htm

George Neville-Neil's brief tutorial on pmcstat

Difference between revisions of "SMT profiling with pmcstat and perf"

Revision as of 10:19, 15 July 2025

Contents

Symmetric multithreading (SMT)

SMT principles

SMT is "multi-thread" not "multi-core"

Comparison to RISC-V HARTs

Benchmark code

pmcstat

perf

Additional Resources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

Print/export

@@ Line 1: / Line 1: @@
-This article discusses profiling symmetric multithreading (SMT) on the POWER9 architecture.  It uses both Debian with perf and Big-endian FreeBSD and pmcstat.
+This article discusses profiling symmetric multithreading (SMT) on the POWER9 architecture.  It uses both Big-endian FreeBSD with pmcstat and Debian with perf.
+The knowledge presented here was derived from a variety of sources which can be found in the [[#Additional Resources]] section.
 === Symmetric multithreading (SMT) ===
 ==== SMT principles ====
-===== SMT is not "multi-core" =====
+===== SMT is "multi-thread" not "multi-core" =====
+This is an important distinction.  SMT is a technology that increases throughput of instructions through parallelization where there are under-used CPU components.  While SMT4 can support four threads per core and SMT8 can support eight threads per core, this is not an additional three and seven cores, respectively.  There are trade-offs and benefits.  Per-thread performance declines with increasing utilization of SMT levels, but overall performance and power consumption efficiency increase.  Note that IBM did not market SMT as "multi-core," while several media sites conflated SMT with increased core count.
 ==== Comparison to RISC-V HARTs ====
 === Benchmark code ===
+The code being profiled and used for benchmarking is the genomic comparison code that the [https://mpjanson.org M. P. Janson Institute for Analytical Medicine] uses to look for molecular mimicry between pathogens and human tissue and hormones.  This code uses a variety of techniques to compare nucleotide and amino acid sequences.  There are both byte-by-byte and vector versions.  The raw source code can be found at [[TBD]]
+Fundamentally the algorithm just compares characters in two sequences for matching.  The goal is to match consecutive changing characters and record the string and location of the matches.  Unlike strstr or similar functions, the string to be matched is not known in advance.  The first character in one sequence is compared against the first character in the other sequence.  If it matches, the second characters are compared.  If they match, the third characters are compared, and so on, until the characters do not match.  If the length of the matched string is below the minimum desired length, the match is discarded and the comparisons start over at the next character.  This can "slide" the comparisons relative to each other, so if the first four characters match but not the fifth, the comparison starts over at the first character of one sequence and the fifth character of the other.  If there is a match of sufficient length, that two-dimensional region is marked as an "exclusion region" and not searched again.
+Nucleotides are denoted as A, T, C, and G.  There are three nucleotides in an amino acid, and 23 amino acids in humans, which  leads to "isocodon" amino acids.  These have the same function but different nucleotide combinations.  There are amino acids that denote the initiation and termination of a gene.  The amino acid is determined by its offset from the initiation amino acid (ATG, methionine), which starts an "Open Reading Frame (ORF)."  ORFs can overlap.  DNA has a forward ("sense") and backward ("anti-sense") direction, so there are three possible amino acids reading forward, and three when reading backwards.  Nucleotide sequences are usually published in the forward direction, and the other strand's nucleotide matching the forward direction is inferred (A matches T, and C matches G, and vice versa).  Genome sequences can be obtained from the [https://www.ncbi.nlm.nih.gov/gene/ NIH National Library of Medicine] gene database, as well as in other places.  The particular sequence is identified by its "accession number," and sequencing of the same organism by different labs does lead to different accession numbers.  Some sequences are "canonical," meaning the generally accepted sequence that is often an assembly of accessions, and others are "contributions," meaning independently obtains sequences submitted for recording.
+The most common comparison method is BLAST, which uses a database of pre-identified nucleotide sequences.  This is useful for rapidly finding large matches, but does not offer granularity for exploring other aspects of the genome.  The above method allows matching of nucleotide and amino acid sequences on both position one and position two nucleotide.  This often reveals "repeat motifs" where a sequence of nucleotides is repeated multiple times within a pathogen.  The current thinking is that this is an "immuno-evasive strategy," allowing the pathogen to escape detection by the host's immune system.  Additionally, by recording the location of the matches, the distance between matches can be calculated.  This is important in deciding if a host antibody would identify healthy tissue and hormones as pathogenic, by having similar shapes.  The above algorithm also allows for detecting "SIDs," which are "Substitutions, Insertions, and Deletions," which occur regularly in rapidly evolving pathogens such as viruses.
+The output is a csv file that is run through some custom image generation code using libjpeg.  An example of the processed data is below
+<center>[[file:Mp_Bmiyamotoi_CP006647.2_vs_Bburgdorferi_GCF_000171735.2_20_1_100_nuc.gif]]</center>
+The image shows matching sequences between two bacteria, B. burgdorferi, which causes Lyme disease, and B. miyamotoi, a bacteria in the same Borrelia family.  Matches are occurring in both the forward and reverse directions.  In this particular case, the B. miyamotoi sequence was assembled from fragments in both the "sense" and "antisense" direction.  This is an artifact of how PCR is done, but has to be considered when doing analysis.  Running only forward matching can lead to missed matches (a "false negative").  This comparison is particularly important because some people will exhibit symptoms of Lyme disease but only test weakly positive or not at all for Lyme disease.  This is known in science but not practiced in medicine, leading to erroneous conclusions about the cause of the patient's symptoms (see [https://doi.org/10.1016/j.cmi.2019.07.026 Borrelia miyamotoi infection leads to cross-reactive antibodies to the C6 peptide in mice and men]).
+The B. burgdefori sequence is 1,455,375 nucleotides long, and the B. miyamotoi sequence is 907,293 nucleotides long.  Each nucleotide is compared at least once, and when there are partial matches too short to place the nucleotide into an exclusion region, it may be compared more than once.  In this case, more than 2.6 trillion comparisons are made (one comparison is forward against forward, and one is forward against reverse; reverse against reverse is the same as forward against forward because of the inverse nucleotide arrangement of DNA).  When searching human genes, the comparisons can be on the order of ''quadrillions'' (10^15) and take hours even on fast CPUs even with multi-core threading.  Optimizing the comparison code saves time and power consumption, and is the focus of the profiling done below.
+=== pmcstat ===
+Default setting is -w 5 for five second output with per interval incrementing.  -C for cumulative counting.
+<pre>
+time  pmcstat -w 1 -v -p pm_l1_icache_miss ./cmp_mp -l 12 -j 2 -s CVD_OM002793.1 CVD_OM003364.1
+</pre>
+No load w/ smt=1
+<pre>
+$ time  pmcstat -w 1 -p pm_l1_icache_miss ./cmp_mp -l 12 -j 2 -T -s CVD_OM002793.1 CVD_OM003364.1
+num args: 9
+-l optarg = 12
+multithreading disabled
+input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
+o_st.st_size: 29766 , t_st.st_size: 29766
+min_pct_f = 1.00
+min_len = 12
+n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
+# p/pm_l1_icache_miss
+number of sequences: 222
+longest sequence: 3270 ([19475, 19475), (22744, 22744)]
+file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv
+.18 real         0.01 user         0.00 sys
+</pre>
+No load w/ smt=4
+<pre>
+$ time  pmcstat -w 1 -p pm_l1_icache_miss ./cmp_mp -l 12 -j 2 -T -s CVD_OM002793.1 CVD_OM003364.1
+num args: 9
+-l optarg = 12
+multithreading disabled
+input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
+o_st.st_size: 29766 , t_st.st_size: 29766
+min_pct_f = 1.00
+min_len = 12
+n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
+# p/pm_l1_icache_miss
+number of sequences: 222
+longest sequence: 3270 ([19475, 19475), (22744, 22744)]
+file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv
+.18 real         0.01 user         0.00 sys
+</pre>
+Load w/ smt=1
+<pre>
+$ time  pmcstat -w 1 -p pm_l1_icache_miss ./cmp_mp -l 12 -j 2 -T -s CVD_OM002793.1 CVD_OM003364.1
+num args: 9
+-l optarg = 12
+multithreading disabled
+input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
+o_st.st_size: 29766 , t_st.st_size: 29766
+min_pct_f = 1.00
+min_len = 12
+n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
+# p/pm_l1_icache_miss
+number of sequences: 222
+longest sequence: 3270 ([19475, 19475), (22744, 22744)]
+file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv
+.17 real         0.00 user         0.00 sys
+</pre>
+Load w/ smt=4
+<pre>
+$ time  pmcstat -w 1 -p pm_l1_icache_miss ./cmp_mp -l 12 -j 2 -T -s CVD_OM002793.1 CVD_OM003364.1
+num args: 9
+-l optarg = 12
+multithreading disabled
+input/CVD_OM002793.1.fasta, input/CVD_OM003364.1.fasta
+o_st.st_size: 29766 , t_st.st_size: 29766
+min_pct_f = 1.00
+min_len = 12
+n_jobs = 2, map_size = 29764 (0x7444), slice_mod 0, against_sz 29764
+# p/pm_l1_icache_miss
+number of sequences: 222
+longest sequence: 3270 ([19475, 19475), (22744, 22744)]
+file name: mp_CVD_OM003364.1_vs_CVD_OM002793.1_2_12_0_100_nuc.csv
+.84 real         0.00 user         0.01 sys
+</pre>
+=== perf ===
+=== Additional Resources ===
+[[:File:POWER9 um OpenPOWER v21 10OCT2019 pub.pdf|POWER9 User Manual v21]]
+[[:File:POWER9_PMU_UG_v12_28NOV2018_pub.pdf|POWER9 Performance Monitoring Unit User Guide v12]]
+[https://www.ibm.com/support/pages/power-cpu-memory-affinity-3-scheduling-processes-smt-and-virtual-processors POWER CPU Memory Affinity 3 - Scheduling processes to SMT and Virtual Processors]
+https://www.ibm.com/docs/en/linux-on-systems?topic=linuxonibm/performance/tuneforsybase/smtsettings.htm
+[https://freebsdfoundation.org/wp-content/uploads/2014/03/Understanding-Application-and-System-Performance-with-HWPMC4.pdf George Neville-Neil's brief tutorial on pmcstat]