Exploring merge performance in a reducing situation

How does `merge`'s performance affect tasks it's involved with?

Observation: Assembling a hashmap by merging imposes substantial overhead during multi-threaded reductions. A faster merging implementation would greatly speed up multi-threaded reductions.

It's all well and good to benchmark various merging strategies in isolation, but what are the practical consequences of one merge implementation being slower than another?

merge walks its input arguments, so it is perhaps not well-suited for the the combining phase of a multi-threaded fold or transduce tasks. We have previously measured merge's performance in isolation, but it's difficult to say if it is 'good' or 'bad' unless we have something purposeful to compare it to. Let's measure three reduction-style tasks.

Reduce over large hashmaps (i.e., up to one-million key-value pairs) with multi-threaded fold and transduce, which require merging sub-partitions, and compare it to the same task done with reduce which does not require merging.
Reduce over large vectors with the same three utilities, the multi-threading variants requiring concatenation instead of merging.
Reduce over large vectors, summing the results, wherein none of the utilities require concatenation.

Bookmarks are defined here.

Ideas:

Multi-threaded reducing functions provide speed-ups over single-threaded reducing functions when the accumulating value is a scalar.
Multi-threaded reducing functions provide speed-ups over single-threaded reducing functions when the accumulating value is a vector, assembled with a fast combining function, like concat.
Multi-threaded reducing functions do not provide speed-ups over single-threaded reducing functions — and in fact are slower — when the accumulating value is a hashmap, assembled with a slow combining function, merge.

Overall, we observe that the multi-threaded reducing functions provide performance benefits when the accumulating value is a scalar (see Combining with addition below) or when the accumulating value is a collection with a reasonably efficient combining function, like concatenating vectors (Combining with concatenation below). When the multi-threaded reducing functions must partition the input hashmap and build up an output hashmap (Reducing over large hashmaps), their performance greatly lags single-threaded reduce.

Given the improved multi-threaded performance with concatenation, it's likely that the performance degradation is due to partitioning the input hashmaps, primarily with select-keys, and assembling an output collection, primarily with merge. Poor merging performance is thus a key culprit leading to degraded multi-threaded reducing performance. Improving merge is a ripe opportunity for speeding multi-threaded reductions.

Reducing over large hashmaps

We'll consider three reduction functions.

reduce-kv, the single-threaded baseline.
fold, off-the-shelf multi-threaded reducing function.
transduce-kv, a third-party, multi-threaded -kv variant of Clojure's transduce.

Execution times increase with vector length as we should expect. For all sizes of hashmaps, single-threaded reduce (gold diamonds) provides the fastest performance. In fact, the multi-threaded reducing functions, whose only purpose is to reduce faster, require four to ten times as long to generate the output hashmaps.

(fn [n] ((tactics (project-version-lein)) n))

times in seconds, *mean±std*
version	1	10	100	1000	10000	100000	1000000
	arg, n
fold	1.4e-04±2.7e-06	1.6e-04±7.0e-06	2.6e-04±2.3e-05	1.3e-03±3.2e-05	9.7e-03±3.1e-04	1.3e-01±5.8e-03	2.0e+00±8.8e-02
reduce-kv	1.4e-04±1.6e-06	1.5e-04±5.7e-07	1.7e-04±6.5e-07	5.1e-04±1.3e-05	3.5e-03±6.2e-05	3.6e-02±4.5e-04	4.8e-01±9.0e-02
transduce-kv	1.4e-04±2.1e-06	1.5e-04±2.2e-06	1.8e-04±5.1e-07	1.2e-03±9.0e-06	2.0e-02±2.1e-04	3.3e-01±1.9e-02	5.3e+00±9.7e-02

Reducing over large vectors

To put the merging performance penalty into context, we'll look at two similar reductions that do not involve merge-ing hashmaps.

Combining with addition.

The upper of the two following charts (and showable table) compares the evaluation times of summing the numbers within vectors. This task consumes a collection, but does not build up an output collection, merely a running total: a single number. The multi-threaded fold and transduce aren't hobbled by having to combine the built-up collections created by the partitioning.

The task performed is something like this.

(reduce (fn [acc x] (+ acc (lots-of-math x))) 0 [1.23 2.34 3.45...])

lots-of-math being eighteen different arithmetic operations involving logarithms, trig functions, etc., that ought to resist optimization by the JVM.

For vectors containing less than one-thousand elements, performance is nearly identical, as expected. The multi-threaded reducing functions do not partition when the input collection contains fewer than 512 elements. In the middle regime, around one-thousand elements, single-threaded reduce performs somewhat faster because of the overhead imposed by the threading and partitioning machinery. Once the vectors contain ten-thousand elements or more, that overhead starts to pay off as the multi-threaded reducing functions perform about three times as fast as the single-threaded reduce.

Keep in mind that this task involves partitioning the input collection, but the combine phase is mere addition (i.e., computationally cheap).

Combining with concatenation.

The lower chart compares the evaluation times of computing eighteen arithmetic values per vector element, building up a new output vector with the results. Something like this.

                     v----------------------------v
(reduce (fn [acc x] (conj acc (lots-of-math x))) [] [1.23 2.34 3.45...])

Notice that instead of merely summing the lots-of-math result with an accumulating number, we are conjoining onto an ever-growing output collection, a vector. In principle, building an output collection should be a more costly operation.

In contrast to the previous subsection that involved basic summation, building up that output vector requires more involved machinery. The multi-threaded reducing variants, fold and transduce, must partition the input vector, manage a thread pool, distribute and collect the results from the partitions, then finally re-assemble the sub-collections into a single final collection.

Compared to the upper chart which showed a three times speedup over single-threaded reduce, the multi-threaded variants only showed about a two times speedup. We might attribute this decrease in speedup to the necessity to assemble an output collection. In this case with vectors, we are assembling with concat.

(fn [n] ((plus-tactics (project-version-lein)) n))

times in seconds, *mean±std*
version	1	10	100	1000	10000	100000	1000000
	arg, n
pfold	1.4e-04±4.2e-07	1.5e-04±9.6e-07	1.6e-04±7.2e-07	2.9e-04±6.2e-06	8.4e-04±7.6e-06	6.6e-03±2.7e-04	6.2e-02±1.3e-03
preduce	1.5e-04±2.2e-06	1.5e-04±8.8e-07	1.7e-04±1.8e-06	3.3e-04±4.9e-06	1.9e-03±5.5e-05	1.7e-02±1.2e-04	1.7e-01±1.9e-03
ptransduce	1.4e-04±1.6e-06	1.5e-04±2.4e-06	1.6e-04±3.1e-06	2.8e-04±6.5e-06	9.2e-04±1.5e-04	6.3e-03±1.1e-04	5.9e-02±6.9e-04

(fn [n] ((vtactics (project-version-lein)) n))

times in seconds, *mean±std*
version	1	10	100	1000	10000	100000	1000000
	arg, n
vfold	1.3e-04±1.6e-06	1.4e-04±2.1e-06	1.5e-04±7.4e-07	3.1e-04±1.2e-05	1.2e-03±8.8e-05	1.0e-02±1.5e-04	1.0e-01±8.2e-04
vreduce	1.4e-04±3.9e-06	1.4e-04±2.2e-06	1.6e-04±2.1e-06	3.3e-04±8.6e-07	2.1e-03±4.3e-05	2.0e-02±1.3e-04	2.0e-01±1.6e-03
vtransduce	1.3e-04±4.4e-07	1.3e-04±6.4e-07	1.5e-04±1.3e-06	2.9e-04±1.2e-05	1.2e-03±1.2e-05	1.1e-02±2.2e-04	1.1e-01±4.1e-04