SMT profiling with pmcstat and perf

This article discusses profiling symmetric multithreading (SMT) on the POWER9 architecture. It uses both Big-endian FreeBSD with pmcstat and Debian with perf.

The knowledge presented here was derived from a variety of sources which can be found in the #Additional Resources section.

This is an important distinction. SMT is a technology that increases throughput of instructions through parallelization where there are under-used CPU components. While SMT4 can support four threads per core and SMT8 can support eight threads per core, this is not an additional three and seven cores, respectively. There are trade-offs and benefits. Per-thread performance declines with increasing utilization of SMT levels, but overall performance and power consumption efficiency increase. Note that IBM did not market SMT as "multi-core," while several media sites conflated SMT with increased core count.

Comparison to RISC-V HARTs

Benchmark code

The code being profiled and used for benchmarking is the genomic comparison code that the M. P. Janson Institute for Analytical Medicine uses to look for molecular mimicry between pathogens and human tissue and hormones. This code uses a variety of techniques to compare nucleotide and amino acid sequences. There are both byte-by-byte and vector versions. The raw source code can be found at TBD

Fundamentally the algorithm just compares characters in two sequences for matching. The goal is to match consecutive changing characters and record the string and location of the matches. Unlike strstr or similar functions, the string to be matched is not known in advance. The first character in one sequence is compared against the first character in the other sequence. If it matches, the second characters are compared. If they match, the third character is compared, and so on, until the characters do not match. If the length of the matched string is below the minimum desired length, the match is discarded and the comparisons start over at the next character. This can "slide" the comparisons relative to each other, so if the first four characters match but not the fifth, the comparison starts over at the first character of one sequence and the fifth character of the other. If there is a match of sufficient length, that two-dimensional region is marked and not searched again.

Nucleotides are denoted as A, T, C, and G. There are three nucleotides in an amino acid, and 23 amino acids in humans, which leads to "isocodon" amino acids. These have the same function but different nucleotide combinations. There are amino acids that denote the initiation and termination of a gene. The amino acid is determined by its offset from the initiation gene (ATG, methionine). DNA has a forward ("sense") and backward ("anti-sense") direction, so there are three possible amino acids reading forward, and three when reading backwards. Nucleotide sequences are usually published in the forward direction, and the other strand's nucleotide matching the forward direction is inferred (A matches T, and C matches G). Genome sequences can be obtained from the NIH National Library of Medicine gene database, as well as in other places.

The most common comparison method is BLAST, which uses a database of pre-identified nucleotide sequences. This is useful for rapidly finding large matches, but does not offer granularity for exploring other aspects of the genome. The above method allows matching of nucleotide and amino acid sequences on both position one and position two nucleotide. This often reveals "repeat motifs" where a sequence of nucleotides is repeated multiple times within a pathogen. The current thinking is that this is an "immuno-evasive strategy," allowing the pathogen to escape detection by the host's immune system. Additionally, by recording the location of the matches, the distance between matches can be calculated. The above algorithm also allows for detecting "SIDs," which are "Substitutions, Insertions, and Deletions," which occur regularly in rapidly evolving pathogens such as viruese.

pmcstat

perf

Additional Resources

POWER9 User Manual v21

POWER9 Performance Monitoring Unit User Guide v12

POWER CPU Memory Affinity 3 - Scheduling processes to SMT and Virtual Processors

https://www.ibm.com/docs/en/linux-on-systems?topic=linuxonibm/performance/tuneforsybase/smtsettings.htm

George Neville-Neil's brief tutorial on pmcstat

SMT profiling with pmcstat and perf

Contents

Symmetric multithreading (SMT)

SMT principles

SMT is "multi-thread" not "multi-core"

Comparison to RISC-V HARTs

Benchmark code

pmcstat

perf

Additional Resources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

Print/export