Transduction benchmarks

How do Brokvolli's `transduce` variants perform compared to Clojure's `reduce`, et al.?

Observations: The single-threaded variants perform similarly, while the multi-threaded variants are faster for large vectors and heavier computations.

See also:

Two of Brokvolli's goals are

Bring transduce's performance benefits to reduce-kv-style processing.
Extend fold's multi-threading performance benefits to both transduce and transduce-kv.

So let's run some benchmarks to see if we can objectively measure any performance improvements.

We'll define our benchmarks here. For each function, we'll test vectors increasing in length from one element to one‑million elements, by powers of ten. We'll arrange four different tests:

Increment each element, a pre-generated, random floating point number.
Increment, filter, and remove, 1× (three total operations).
Increment, filter, and remove, 2× (six total operations).
Increment, filter, and remove, 4× (twelve total operations).

For each test, we'll use the Criterium benchmarking library to measure the execution times of sixty repetitions of each condition. Benchmarks were run on three explicitly-pinned cores of my rusty desktop computer.

The functions under examination are as follows:

clojure.core/reduce
clojure.core/reduce-kv
clojure.core/transduce
brokvolli.single/transduce-kv
brokvolli.multi/transduce
brokvolli.multi/transduce-kv
a naive sequence implementation

Overall, we observe that the execution times increase with increasing vector lengths. The results are indistinguishable when the vector contains one‑hundred or fewer elements. When the vectors grow longer and the processing stack is small (i.e., one or three transforms), the single-threaded functions tend to perform slightly faster. However, when the vector lengths approach one‑million elements and the processing stack involves more transforms, the multi-threaded variants (fold, multi/transduce, and multi/transduce-kv), offer improvements that scale roughly with the number of processors.

Brokvolli's multi-threaded transduce and transduce-kv may offer performance benefits over their single-threaded counterparts for large collection sizes and deep transformer stacks. For smaller collections or shallow transformer stacks, the single-threaded variants perform better, and present a simpler interface.

01 transform: increment integer

This test performs one operation per element: incrementing the number. Since we're only doing one operation, we can straightforwardly compare to reduce and reduce-kv as well.

Execution times increase with vector length, with the single-threaded functions performing best for all vector lengths. With only one operation per element, the overhead of multi-threading machinery results in slower times.

The lines aren't perfectly straight because Clojure requires some non-zero time to reduce over an empty or nearly-empty vector. On this machine, that non-zero time is about 120 microseconds.

(fn [n] ((tactics-1 (project-version-lein)) (vecs n)))

times in seconds, *mean±std*
version	1	10	100	1000	10000	100000	1000000
	arg, n
core-transduce	1.2e-04±2.1e-06	1.2e-04±6.9e-07	1.3e-04±1.2e-06	1.8e-04±1.6e-06	6.6e-04±1.0e-05	5.8e-03±1.2e-04	6.1e-02±3.7e-03
fold	1.3e-04±2.2e-06	1.3e-04±1.1e-06	1.3e-04±7.1e-07	2.1e-04±1.9e-06	6.8e-04±1.8e-05	5.8e-03±1.3e-04	6.4e-02±5.2e-03
multi-transduce	1.3e-04±2.4e-06	1.3e-04±2.7e-06	1.3e-04±2.6e-06	2.2e-04±7.8e-06	7.2e-04±6.0e-05	5.9e-03±1.6e-04	7.2e-02±7.0e-03
multi-transduce-kv	1.3e-04±3.4e-06	1.3e-04±2.1e-06	1.3e-04±2.0e-06	2.2e-04±9.1e-06	7.3e-04±3.3e-05	6.2e-03±2.1e-04	7.0e-02±6.6e-03
reduce	1.2e-04±2.1e-06	1.2e-04±1.2e-06	1.3e-04±1.2e-06	1.7e-04±2.7e-06	5.5e-04±1.5e-05	4.8e-03±1.2e-04	5.0e-02±3.0e-03
reduce-kv	1.2e-04±2.6e-06	1.2e-04±7.6e-07	1.3e-04±1.1e-06	1.6e-04±2.9e-06	5.0e-04±8.1e-06	4.3e-03±1.4e-04	4.5e-02±2.9e-03
single-transduce-kv	1.2e-04±2.4e-06	1.2e-04±2.0e-06	1.3e-04±9.8e-07	1.7e-04±2.5e-06	5.4e-04±8.5e-06	4.6e-03±1.6e-04	4.9e-02±3.1e-03

03 transforms: Transduce with `map`+`filter`+`remove`, 1X

This test involves three operations per element: incrementing the number, filtering (retaining) if the result is less than a specified cutoff, then removing if the result is greater than a different specified cutoff.

With a transform stack of this size, all functions performed roughly similarly.

(fn [n] ((tactics-2 (project-version-lein)) (vecs n)))

times in seconds, *mean±std*
version	1	10	100	1000	10000	100000	1000000
	arg, n
core-transduce	1.3e-04±2.4e-06	1.3e-04±1.6e-06	1.3e-04±1.5e-06	2.1e-04±3.3e-06	9.7e-04±1.0e-05	9.0e-03±1.4e-04	9.4e-02±4.4e-03
fold	1.3e-04±2.2e-06	1.3e-04±9.9e-07	1.4e-04±7.0e-07	2.4e-04±3.4e-06	8.1e-04±2.6e-05	6.9e-03±2.6e-04	8.4e-02±6.1e-03
multi-transduce	1.3e-04±5.0e-06	1.3e-04±2.8e-06	1.4e-04±2.6e-06	2.4e-04±5.5e-06	8.0e-04±3.1e-05	6.8e-03±2.8e-04	7.8e-02±8.8e-03
multi-transduce-kv	1.3e-04±2.8e-06	1.3e-04±2.4e-06	1.3e-04±2.4e-06	2.4e-04±5.4e-06	8.3e-04±3.2e-05	7.0e-03±1.8e-04	8.4e-02±8.0e-03
naive-seq	1.3e-04±1.7e-06	1.3e-04±1.9e-06	1.4e-04±2.0e-06	2.4e-04±2.4e-06	1.3e-03±4.8e-05	1.2e-02±2.4e-04	1.3e-01±8.0e-03
single-transduce-kv	1.3e-04±2.2e-06	1.2e-04±9.1e-07	1.3e-04±1.2e-06	2.0e-04±2.6e-06	8.6e-04±1.6e-05	8.0e-03±1.5e-04	8.3e-02±5.0e-03

06 transforms: Transduce with `map`+`filter`+`remove`, 2X

This test is similar to the previous, except the three operations were performed twice (with different filter/remove cutoffs), for a total of six operations per element. With this deeper transform stack, the multi-threaded variants begin to offer faster performance (i.e., lower time) on vectors longer than one thousand elements.

(fn [n] ((tactics-3 (project-version-lein)) (vecs n)))

times in seconds, *mean±std*
version	1	10	100	1000	10000	100000	1000000
	arg, n
core-transduce	1.3e-04±2.8e-06	1.3e-04±5.3e-07	1.4e-04±1.9e-06	2.9e-04±1.9e-06	1.8e-03±2.3e-05	1.7e-02±2.5e-04	1.8e-01±5.4e-03
fold	1.3e-04±4.1e-06	1.3e-04±2.4e-06	1.5e-04±2.8e-06	3.1e-04±3.0e-05	1.1e-03±1.3e-04	9.3e-03±4.8e-04	1.1e-01±8.4e-03
multi-transduce	1.3e-04±1.8e-06	1.3e-04±3.3e-06	1.4e-04±2.5e-06	3.1e-04±2.5e-05	1.2e-03±1.6e-04	1.0e-02±8.7e-04	1.1e-01±1.6e-02
multi-transduce-kv	1.3e-04±3.5e-06	1.3e-04±3.1e-06	1.4e-04±2.9e-06	2.7e-04±3.6e-06	1.0e-03±8.1e-05	9.1e-03±7.8e-04	9.6e-02±1.1e-02
single-transduce-kv	1.3e-04±2.0e-06	1.3e-04±1.0e-06	1.4e-04±1.3e-06	2.4e-04±4.3e-06	1.3e-03±2.0e-05	1.2e-02±2.1e-04	1.3e-01±5.7e-03

12 transforms: Transduce with `map`+`filter`+`remove`, 4X

Analogous to the previous examples, this test performs twelve operations per element (four cycles of increment, filter, and remove). The multi-threaded variants (fold, multi/transduce, and multi/transduce-kv), demonstrate noticeably improved performance (i.e., lower time) for vectors containing more than one thousand elements.

Note the log-log scale, which may visually obscure the performance deltas. Click 'Show details' to see a numeric chart of the measurements. For example, the multi-threaded variants handle the one-million element vectors about twice as fast as the single-threaded variants.

(fn [n] ((tactics-4 (project-version-lein)) (vecs n)))

times in seconds, *mean±std*
version	1	10	100	1000	10000	100000	1000000
	arg, n
core-transduce	1.3e-04±2.8e-06	1.3e-04±8.7e-07	1.5e-04±1.2e-06	3.7e-04±1.9e-06	2.6e-03±2.8e-05	2.6e-02±3.4e-04	2.7e-01±1.0e-02
fold	1.3e-04±2.8e-06	1.3e-04±1.2e-06	1.5e-04±3.1e-06	3.5e-04±4.7e-06	1.4e-03±1.4e-04	1.2e-02±8.1e-04	1.4e-01±1.5e-02
multi-transduce	1.3e-04±2.5e-06	1.3e-04±1.3e-06	1.5e-04±2.2e-06	3.6e-04±5.6e-06	1.3e-03±3.7e-05	1.1e-02±6.5e-04	1.3e-01±1.4e-02
multi-transduce-kv	1.3e-04±2.9e-06	1.3e-04±2.7e-06	1.5e-04±3.8e-06	3.3e-04±2.0e-05	1.3e-03±1.1e-04	1.1e-02±7.0e-04	1.2e-01±1.3e-02
single-transduce-kv	1.3e-04±2.7e-06	1.3e-04±9.8e-07	1.5e-04±1.1e-06	3.4e-04±2.5e-06	2.2e-03±1.9e-05	2.2e-02±2.7e-04	2.3e-01±8.0e-03