Addendum 3: Benchmarking with different processor counts

How do Brokvolli's multi-threaded `transduce` benchmarks change with additional CPUs?

Observations: Performance improves roughly linearly with more CPUs.

See also:

Benchmarks defined here, similar to before.

We'll test vectors increasing in length from one element to one-hundred-thousand elements, by powers of ten. For each element, a pre-generated random floating point number, we'll construct a hashmap of eighteen mathematical operations on that number (trig ops, logarithms, etc.) that the JVM compiler oughtn't be able to optimize. To amplify the computation requirements, we'll make that hashmap five times for each element, giving a total of ninety computations per element of the vector. We'll use the Criterium benchmarking library to measure the execution times of sixty repetitions of each condition. Benchmarks were run on a 64-CPU machine with 128 GiB of memory (an Amazon Elastic Cloud Compute c6i.16xlarge instance). The taskset utility explicitly assigns the process affinity to exactly 1, 2, 4, 8, 16, 32, or 64 CPUs.

Overall, we observe that the execution times increase with increasing vector lengths. When the vector is short, the single-threaded transduction is slightly faster. Once the vector grows to a few thousand elements, adding processors (i.e., more resources to compute a thread's task) decreases the computation time, roughly proportional to the number of processors. Brokvolli's multi-threaded transduce and its companion multi-threaded transduce-kv therefore offer improved performance when executing compute-heavy tasks on large quantities of elements in a collection.

Ninety mathematical operations per element

This test performs the following mathematical operations per element: cube root, ceiling, cosine, cubing, decrement, natural exponentiation, exponentiation, floor, identity, increment, logarithm, natural logarithm, multiply by π, negation, round, sine, square-root, and tangent. For each element, that group of operations was performed five times on jittered values of the elements (i.e., (* 18 5) => 90 operations per vector element). The "version" labels in the chart legend correspond to the taskset --cpu-list directives: 0-3 pins the benchmark to processors zero through three (i.e., four processors), 0-31 pins the benchmark to processors zero through thirty-one (i.e., thirty-two processors), etc.

When the sample vector contains one hundred or fewer elements, the 1-processor benchmarks (a proxy for single-threaded transduce) runs slightly faster (blue circles), likely due to the minimal overhead of managing the threads running on a single CPU. Above one thousand elements, the multi-processor (i.e., >1 CPU) tests demonstrate a speedup. For vectors containing ten- or one-hundred-thousand elements, the evaluation times decrease with increasing numbers of processors/threads. (Note that the logarithmic scale may disguise the magnitude of the speedups; select the "Show details" button to see the measured times).

Here's a rough summary of the speedups for one-hundred-thousand-element vectors.

number of processors	speedup	naive expected speedup
1	1×	—
2	1.6×	80%
4	3.7×	93%
8	6.5×	75%
16	11×	69%
32	22×	69%
64	24×	38%

For this synthetic workload, recruiting up to thirty-two processors increased multi-threaded transduce's ability roughly proportionately. Two CPUs could chew through same partitions 1.6× as fast as one CPU; four CPUs could do the job 3.7× as fast as one CPU, etc. In general, the speedups are two-thirds to three-quarters times the number of CPUs. In other words, 8 CPUs give us about a 6× speedup, while 16 CPUs give us about an 11× speedup.

An anomalous limit emerges somewhere between thirty-two (gold circles) and sixty-four (magenta diamonds) CPUs where the performance does not improve proportionally. Investigating this phenomenon is beyond the scope of this report.

(fn [n] (multi/transduce partition-at concatv xform-1 tconj (vecs n)))

times in seconds, *mean±std*
version	1	10	100	1000	10000	100000
	arg, n
0-0	1.4e-05±4.6e-07	1.4e-04±1.6e-06	1.4e-03±3.5e-05	1.5e-02±2.3e-03	2.4e-01±1.2e-01	2.6e+00±4.7e-01
0-1	1.6e-05±2.4e-06	1.6e-04±2.2e-05	1.6e-03±2.2e-04	8.4e-03±7.5e-04	1.8e-01±2.6e-02	1.6e+00±2.2e-01
0-3	1.7e-05±2.4e-06	1.7e-04±1.9e-05	1.6e-03±2.0e-04	8.1e-03±6.2e-04	5.6e-02±1.1e-02	7.1e-01±1.0e-01
0-7	1.7e-05±2.4e-06	1.8e-04±2.5e-05	1.8e-03±2.8e-04	8.0e-03±4.2e-04	3.5e-02±1.1e-02	4.0e-01±9.6e-02
0-15	1.7e-05±2.6e-06	1.5e-04±2.0e-05	1.7e-03±2.6e-04	7.7e-03±8.9e-04	1.8e-02±4.0e-03	2.3e-01±6.8e-02
0-31	1.7e-05±2.3e-06	1.8e-04±2.7e-05	1.7e-03±2.5e-04	8.2e-03±9.7e-04	1.0e-02±4.4e-03	1.2e-01±3.5e-02
0-63	1.6e-05±8.5e-07	1.8e-04±2.5e-05	1.7e-03±2.0e-04	8.2e-03±8.2e-04	1.4e-02±4.5e-03	1.1e-01±2.7e-02

Addendum 3: Benchmarking with different processor counts

How do Brokvolli's multi-threaded transduce benchmarks change with additional CPUs?

Ninety mathematical operations per element

(fn [n] (multi/transduce partition-at concatv xform-1 tconj (vecs n)))

How do Brokvolli's multi-threaded `transduce` benchmarks change with additional CPUs?