Addendum 2: Benchmarking with different partitions

How do Brokvolli's multi-threaded `transduce` benchmarks vary with different partition sizes?

Observations: Partitions must be significantly smaller than the input vector to realize a performance improvement.

See also:

Benchmarks defined here, similar to before. For exploring the affects of partition size, we'll consider only multi-threaded transduce. We'll test vectors increasing in length from one element to one-hundred-thousand elements, by powers of ten. For each element, a pre-generated random floating point number, we'll construct a hashmap of eighteen mathematical operations on that number (trig ops, logarithms, etc.) that the JVM compiler oughtn't be able to optimize. To amplify the computation requirements, we'll construct variations of that hashmap five times for each element. We'll use the Criterium benchmarking library to measure the execution times of sixty repetitions of each condition. Benchmarks were run on three explicitly-pinned cores of my antique desktop computer.

Overall, we observe that the execution times increase with increasing vector lengths. The results are relatively indistinguishable when the vector contains ten or fewer elements. When the partition is exactly the vector's length, the transduction runs exactly as fast as single-threaded transduce, as expected. Once the vector length exceeds the partition size, we can see performance benefits to the multi-threading. When the vector length is one-hundred elements or more and when the partition size results in multiple threads, the processing time is roughly twice as fast.

Construct hashmap of eighteen mathematical ops, per element

This test performs the following mathematical operations per element: cube root, ceiling, cosine, cubing, decrement, natural exponentiation, exponentiation, floor, identity, increment, logarithm, natural logarithm, multiply by π, negation, round, sine, square-root, and tangent. For each element, that group of operations was performed five times on jittered values of the elements (i.e., (* 18 5) => 90 operations per vector element). The partition values are the zero-padded integers following "version" in the chart legend.

For vectors of one or ten elements, none of the partitioning limits are distinguishable, as expected. For the one-hundred element vector, we can see a performance benefit due to the multi-threading when the the vector is partitioned into chunks of about ten (blue circles). For the one-thousand element vectors, partitions of ten or one hundred chunks (gold diamonds) provide performance improvements (i.e., shorter times). This pattern continues: when the vector is partitioned into about ten chunks, we observed a factor of two performance improvement.

Notice that the performance improves when the partition size is one decade smaller than the vector length.

vector length	best	worst	factor	chunk size at improvement
100	1.4E-3	2.7E-3	1.9×	10
1000	1.1E-2	2.9E-2	2.6×	10–100
10000	1.3E-1	3.2E-1	2.5×	100–1000
100000	1.6E0	2.9E0	1.8×	100–10000

Note that a one-hundred-thousand partition chunk (red triangles) is as large as the largest tested vector. Therefore, red triangles represent no chunking for any vector length tested, and the performance is what we would observe with single-threaded transduce.

(fn [n] ((tactics-1 "multi-transduce") (vecs n)))

times in seconds, *mean±std*
version	1	10	100	1000	10000	100000
	arg, n
000010	1.6e-04±7.8e-06	3.7e-04±2.2e-05	1.3e-03±1.7e-04	1.1e-02±2.0e-03	1.2e-01±3.3e-02	1.4e+00±2.0e-01
000100	1.6e-04±7.4e-06	4.0e-04±5.6e-05	2.6e-03±2.1e-04	1.2e-02±2.5e-03	1.5e-01±4.1e-02	1.3e+00±9.4e-02
001000	1.6e-04±9.1e-06	4.4e-04±6.9e-05	2.6e-03±2.0e-04	2.9e-02±4.2e-03	1.2e-01±2.1e-02	1.6e+00±2.1e-01
010000	1.6e-04±8.3e-06	4.0e-04±5.4e-05	2.6e-03±2.0e-04	2.7e-02±3.1e-03	3.1e-01±3.0e-02	1.7e+00±2.4e-01
100000	1.8e-04±1.9e-05	4.0e-04±2.5e-05	2.7e-03±2.7e-04	3.0e-02±5.9e-03	3.0e-01±5.2e-02	3.1e+00±2.2e-01

Addendum 2: Benchmarking with different partitions

How do Brokvolli's multi-threaded transduce benchmarks vary with different partition sizes?

Construct hashmap of eighteen mathematical ops, per element

(fn [n] ((tactics-1 "multi-transduce") (vecs n)))

How do Brokvolli's multi-threaded `transduce` benchmarks vary with different partition sizes?