Benchee 1.1.0 released + why did it take so long

Benchee 1.1.0 has finally hit After, well, almost 3 years. So, in this blog post we’ll dive into:

  1. What are the changes
  2. Why did it take so long, with some (significant) musings on Open Source and bugs as well as my approach to it

What does Benchee 1.1.0 Bring to the table

The star of the show certainly are the two new major features: reduction measurements and profiling! Then there is also a nasty bug that was squashed. Check out the Changelog for all.

Reduction Counting

Reductions joins execution time and memory consumption as the third measure Benchee can take. This one was kicked off way back when someone asked in our #benchee channel about adding this feature. What reductions are, is hard to explain. In short, it’s not very well defined but a “unit of work”. The BEAM uses them to keep track of how long a process has run. As the Beam Book puts it as follows:

BEAM solves this by keeping track of how long a process has been running. This is done by counting reductions. The term originally comes from the mathematical term beta-reduction used in lambda calculus.

The definition of a reduction in BEAM is not very specific, but we can see it as a small piece of work, which shouldn’t take too long. Each function call is counted as a reduction. BEAM does a test upon entry to each function to check whether the process has used up all its reductions or not. If there are reductions left the function is executed otherwise the process is suspended.

Beam Book, Chapter 5.3

This can help you, as it’s not affected by system load so you could make assumptions in your CI about performance. It’s not 1:1 but it helps. Of course, check out Benchee’s docs about it. Biggest shout out goes to Devon for implementing it.

You can simply specify reduction_time and there you go:

list = Enum.to_list(1..10_000)
map_fun = fn i -> [i, i * i] end
"flat_map" => fn -> Enum.flat_map(list, map_fun) end,
"map.flatten" => fn -> list |> |> List.flatten() end
reduction_time: 2
view raw bench.exs hosted with ❤ by GitHub
Operating System: Linux
CPU Information: AMD Ryzen 9 5900X 12-Core Processor
Number of Available Cores: 24
Available memory: 31.27 GB
Elixir 1.13.3
Erlang 24.2.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 2 s
parallel: 1
inputs: none specified
Estimated total run time: 18 s
Benchmarking flat_map …
Benchmarking map.flatten …
Name ips average deviation median 99th %
flat_map 3.52 K 283.95 μs ±10.98% 279.09 μs 500.28 μs
map.flatten 2.26 K 441.58 μs ±20.43% 410.51 μs 680.60 μs
flat_map 3.52 K
map.flatten 2.26 K – 1.56x slower +157.64 μs
Reduction count statistics:
Name Reduction count
flat_map 65.01 K
map.flatten 124.52 K – 1.92x reduction count +59.51 K
**All measurements for reduction count were the same**
view raw output hosted with ❤ by GitHub

It’s worth noting that reduction counts will differ between different elixir and erlang versions – as we often noticed in our own CI setup.

Profile after benchmarking

Another feature that I’d never imagined having in Benchee, but thanks to community suggestions (and implementation!) it came to be. This one in particular was even suggested by José Valim himself – chatting with him he asked if there were plans to include something like this as his workflow would often be:

1. benchmark to see results

2. profile to find improvement opportunities

3. improve code

4. Start again at 1.

Makes perfect sense, I just never thought of it. So, you can now say profile_after: true or even specify a specific profiler + options.

list = Enum.to_list(1..10_000)
map_fun = fn i -> [i, i * i] end
"flat_map" => fn -> Enum.flat_map(list, map_fun) end,
"map.flatten" => fn -> list |> |> List.flatten() end
profile_after: true
view raw benchmark.exs hosted with ❤ by GitHub
Operating System: Linux
CPU Information: AMD Ryzen 9 5900X 12-Core Processor
Number of Available Cores: 24
Available memory: 31.27 GB
Elixir 1.13.3
Erlang 24.2.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking flat_map …
Benchmarking map.flatten …
Name ips average deviation median 99th %
flat_map 3.51 K 284.55 μs ±13.79% 277.29 μs 557.14 μs
map.flatten 2.09 K 477.46 μs ±30.97% 410.71 μs 871.02 μs
flat_map 3.51 K
map.flatten 2.09 K – 1.68x slower +192.91 μs
Profiling flat_map with eprof…
Profile results of #PID<0.237.0>
Total 30004 100.0 6864 0.23
Enum.flat_map/2 1 0.00 0 0.00
anonymous fn/2 in :elixir_compiler_1.__FILE__/1 1 0.00 0 0.00
:erlang.apply/2 1 0.03 2 2.00
:erlang.++/2 10000 17.35 1191 0.12
anonymous fn/1 in :elixir_compiler_1.__FILE__/1 10000 30.29 2079 0.21
Enum.flat_map_list/2 10001 52.33 3592 0.36
Profile done over 6 matching functions
Profiling map.flatten with eprof…
Profile results of #PID<0.239.0>
Total 60007 100.0 9204 0.15 1 0.00 0 0.00
:lists.flatten/1 1 0.00 0 0.00
anonymous fn/2 in :elixir_compiler_1.__FILE__/1 1 0.01 1 1.00
List.flatten/1 1 0.01 1 1.00
:erlang.apply/2 1 0.02 2 2.00
anonymous fn/1 in :elixir_compiler_1.__FILE__/1 10000 16.17 1488 0.15
Enum."-map/2-lists^map/1-0-"/2 10001 26.81 2468 0.25
:lists.do_flatten/2 40001 56.98 5244 0.13
Profile done over 8 matching functions
view raw output hosted with ❤ by GitHub

We didn’t implement the profiling ourselves, but instead we rely on the builtin profiling tasks like this one. To make the feature fully compatible with hooks, I also had to send a small patch to elixir and so after_each hooks won’t work with profiling until it’s released. But, nobody uses hooks anyhow so, who cares? 😛

This feature made it in thanks to Pablo Costas, and his great work. I’m happy to highlight that not only did this contribution give us all a great Benchee feature, but also a friendship to boot. Oh, the wonders of Open Source. 💚

Measurement accuracy on Mac

Now to the least fun part about this release. There is a bugfix, a quite important one at that. Basically on Mac OS previous Benchee versions might report inaccurate results for very fast benchmarks (< 10 microseconds). There are many more musings in this issue, but basically we relied on the operating system clock returning times in a value that it can accurately measure in. Alas, OSX reports in nanoseconds but only has microsecond accuracy (leading to measurements being multiples of 1000). However, even the operating system clock reported nanosecond accuracy – so I even reported a bug on erlang/otp that was thankfully fixed in 22.2.

Fixing this was hard and stressful, which leads nicely into the next major section…

Why it took so long, perfectionism and open source

So, why did it take so long? I blogged earlier today about some of the things that held me back the past 1.5 years in “The Silence Between”. However, you can see that a lot of these features already landed in early 2020, so what gives?

The short answer is the bug above was hard to fix and I needed to fix it. The long answer is… well, long.

I think I could describe myself as a pragmatic perfectionist. I’m happy to implement an MVP, I constantly ask “Do we really need this?” or “Can we make this simpler and deliver it faster?”, but what I end up shipping I want to… well, almost need to be great for what we decided to ship. I don’t want to release with bugs, constant error notifications or barely anything tested. I can make lots of tradeoffs, as long as I decide on them like: Ok we’ll duplicate this code now, as we have no idea what a good abstraction might be and we don’t wanna lock ourselves in. But something misbehaving that I thought was sublime? Oh, the pain.

Why am I highlighting this? Well, Benchee reporting wrong results is frightening to me. Benchee has one core promise, and that promise is to measure your functions as accurately as possible. Also, in my opinion fixing critical bugs such as this one should have the highest priority. I can’t, for myself, justify working on Benchee while not working on that bug. I know, it’s not a great attitude and I should have released the features on main and just released the bug fix later. I do. But I felt like, all energy had to be spent on fixing that bug.

And working on that bug was hard. It’s a Mac only bug and I famously do not own or want to own a Mac. My partner owns one, but when I’m doing Open Source chances are she’s at her computer as well. And then, to investigate something like this, I need a couple of hours of interrupted time with no distractions on my mind as well. I might as well not even start otherwise. It certainly didn’t help that the bug randomly disappeared, when trying to look at it.

The problem that I did not have a Mac to fix this was finally solved when I started a new job, but then first the stress was too high and then my arms were injured (as mentioned in the other blog post). My arms finally got better and I had a good 4h+ to set aside to fix this bug. It can be kind of hard, to get that dedicated time but it’s absolutely needed for an intricate bug such as this one.

So, that’s the major reason it took so long. I mean, it involved finding a bug in Erlang itself. And, me working around that bug which is some code that well… was almost harder to write than the actual fix.

I would be amiss not to mention something else: It’s perfectly fine for Open Source project not to update! Sometimes, they are just done. Or the maintainers have more important things to do. I certainly consider Benchee “done” since 1.0 as it has all features I really wanted it to have. You see, reduction counting and profiler after are great features, but they are hardly essential.

Still, Benchee having a rather important bug for so long really made me feel guilty and bad. Even worse, because I didn’t fix the bug those great contributions from Devon and Pablo were never released. That’s another thing, that’s very important to me: Whoever takes the time to contribute should have a great experience and their contribution should be valued. The ultimate show of appreciation is releasing the feature they worked on is getting it released into people’s hands.

At times those negative feelings (“Oh no there is a bug” & “Oh no these great features lie around unreleased”) paradoxically lead me to stay away from Benchee even more since I felt bad about this state. Yes, it was only on mac and only affected benchmarks where individual function invocations took less than 10 microseconds. But still, that’s the perfectionist in me. This should be fixed within weeks, not 2.5 years. Most certainly, ready to ship features shouldn’t just chill on main for years. Release early, release often.

Anyhow, thanks for reading my musings on Open Source, responsibility, pragmatism and perfectionism. The bug is fixed now, the features are released and I’m happy. Who knows what’s next for Benchee.

Happy benchmarking!

Released: benchee 0.99, 1.0 & friends

It’s finally here – benchee 1.0! 🎉🎉🎉

The first benchee release was almost 3 years ago – it started a mission to improve benchmarking tooling in the elixir eco system. And now we’re not at the goal – after all it’s never done and we’re not short of ideas of what to do.

What’s in a 1.0?

Also called “Why did you take so long to call it 1.0?” – 1.0 for me means a good level of stability. A level where not every second new benchee version all formatters would need updates because they would break otherwise. And in recent releases we have still shuffled major data structures around A LOT (just check all the Breaking Changes (Plugins)). Benchee was mostly stable from a user perspective – but this means it’s less of a risk factor to go ahead and write your own plugins, something that benchee always encouraged/was built to empower. I don’t have any plans for 2.0 right now – all features that I know of can easily be added to the existing structure.

It also means I’m happy with the features. What benchee offers is great, we have:

  •  nano second precise run time measurements
  •  memory measurements
  • rich statistics
  • show information such as CPU, elixir and erlang versions about the system running the benchmarks
  •  support for multiple inputs
  •  hooks to support even unconventional scenarios
  • you can access it all via your CLI, CSV, JSON or HTML (including nice graphs!)
  • and actually a lot more 😉

Benchee might have started out as “I want benchmark-ips in elixir” but it has surpassed it in many ways so that I’d actually want to have benchee in Ruby but that’s another topic. However, that makes me proud of what we accomplished.

With that amount of polish I can also easily sit back and not work on benchee for some time because I know it’s good – it is “done” in the sense that it can do everything I wanted it to do when I started the project (and even more!).

As for what is actually in it mostly removing deprecations. You can check out the Changelog.

What’s 0.99?

I found it nice how rspec did their 2.99 –> 3.0 switch – get it to run on 2.99 without deprecation warnings and then you can safely use 3.0. That was a great user experience. Ember.js handles their major versions similarly. Now, benchee is nowhere near as complex as those 2 but we thought providing that nicety would still be great.


As mentioned before 0.99/1.0 don’t actually include many features – the previous 0.14.0 release from about a month ago was very feature packed. These releases are a lot about polish. Redoing the documenation, updating names, fixing typespecs, being more careful about what is and isn’t exposed in the public interface.

A small but important feature made it in though – displaying the absolute difference between measurements:

flat_map           2.34 K
map.flatten        1.22 K - 1.92x slower +393.09 μs

See that little+393.09 μs? It’s how much slower it was on average in absolute terms. With these comparisons people often focus too much on “OMG it’s almost 2 times as slow!!!” but this number helps put it into context: It’s not even half a millisecond. If you only do this once in a web request the difference likely doesn’t matter. It’s a calculation I always did in my head, I’m happy to make it easily accessible for everyone.

Along with this patch those values were added to our Statistics struct – including the “x-times slower” values, which means formatters no longer have to implement this themselves! Hooray!

We’re an org now!

An astute observer might have seen that all my benchee repos have been moved to the github organization bencheeorg. What’s that all about? It’s mostly a tribute to benchee not being a personal project but a community project. Many people have contributed massively to benchee, most notably Devon and Eric. Without Devon we probably still wouldn’t have memory measurements and without Eric our unit scaling wouldn’t be as great as it is. Others such as Michał and OvermindDL1 have also contributed a lot through ideas, testing and help (especially with memory measurements :)). Feels wrong to keep the repositories attached to a single person.

Also, should anything happen to me (which I hope won’t happen), the others could still add people to the organization and carry on.

It also helps with another problem I’ve had: I want to extract small useful libraries from benchee: Statistics (introduced by me), System Information gathering (introduced by Devon) and unit scaling (introduced by Eric) – where do I put these repos? All under their own name space? All under my name space? Nah, I put them in the benchee organization where we share ownership – that’s where they belong.

The future of benchee

As I said benchee isn’t done – there is an open PR to add reference jobs which didn’t make it into the release. We’d like to add more types of memory measurements, as well as measuring reductions, incorporating profiling right after benchmarking to drill down on those bottle necks sounds great, more compact console output and also include the benchmarking code itself in the suite so that formatters could display it. Finally, now might finally be the time to brush up on meta programming and write that DSL wrapper that people apparently want.

Help with all of those is very welcome. Personally, I’m really itching to extract these libraries I mentioned – let’s see about that. Also to showcase benchee with some nice benchmarks – after all what good is a great benchmarking tool if you rarely use it?

Benchee 0.14.0 – Micro Benchmarks? Pah, how about Nano Benchmarks!

Long time since the last benchee release, heh? Well, this one really packs a punch to compensate! It brings you a higher precision while measuring run times as well as a better way to specify formatter options. Let’s dive into the most notable changes here, the full list of changes can be found in the Changelog.

Of course, all formatters are also released in compatible versions.

Nanosecond precision measurements

Or in other words making measurements 1000 times more precise 💥

This new version gives you much more precision which matters especially if you benchmark very fast functions. It even enables you to see when the compiler might completely optimize an operation away. Let’s take a look at this in action:

range = 1..10
integer1 = :rand.uniform(100)
integer2 = :rand.uniform(100)
"Integer addition (wrong)" => fn -> 1 + 1 end,
"Integer addition" => fn -> integer1 + integer2 end,
"String concatention (wrong)" => fn -> "1" <> "1" end,
"adding a head to an array (wrong)" => fn -> [1 | [1]] end,
"++ array concat (wrong)" => fn -> [1] ++ [1] end,
"noop" => fn -> 0 end,
"" => fn ->, fn i -> i end) end
time: 1,
warmup: 1,
memory_time: 1,
formatters: [{Benchee.Formatters.Console, extended_statistics: true}]
view raw fast.exs hosted with ❤ by GitHub
tobi@speedy:~/github/benchee$ mix run samples/fast_functions.exs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Number of Available Cores: 8
Available memory: 15.61 GB
Elixir 1.8.0
Erlang 21.2.2
Benchmark suite executing with the following configuration:
warmup: 1 s
time: 1 s
memory time: 1 s
parallel: 1
inputs: none specified
Estimated total run time: 21 s
Benchmarking ++ array concat (wrong)…
Benchmarking Integer addition…
Benchmarking Integer addition (wrong)…
Benchmarking String concatention (wrong)…
Benchmarking adding a head to an array (wrong)…
Benchmarking noop…
Name ips average deviation median 99th %
String concatention (wrong) 814.22 M 1.23 ns ±2868.77% 0 ns 21 ns
++ array concat (wrong) 749.67 M 1.33 ns ±2705.83% 0 ns 23 ns
noop 639.06 M 1.56 ns ±2388.12% 0 ns 35 ns
adding a head to an array (wrong) 553.47 M 1.81 ns ±2228.78% 0 ns 29 ns
Integer addition (wrong) 544.93 M 1.84 ns ±2803.80% 0 ns 31 ns
Integer addition 179.88 M 5.56 ns ±737.19% 4 ns 39 ns 2.30 M 435.06 ns ±2872.78% 356 ns 667 ns
String concatention (wrong) 814.22 M
++ array concat (wrong) 749.67 M – 1.09x slower
noop 639.06 M – 1.27x slower
adding a head to an array (wrong) 553.47 M – 1.47x slower
Integer addition (wrong) 544.93 M – 1.49x slower
Integer addition 179.88 M – 4.53x slower 2.30 M – 354.23x slower
Extended statistics:
Name minimum maximum sample size mode
String concatention (wrong) 0 ns 9219 ns 1.54 M 0 ns
++ array concat (wrong) 0 ns 17501 ns 1.54 M 0 ns
noop 0 ns 9220 ns 1.53 M 0 ns
adding a head to an array (wrong) 0 ns 23216 ns 1.54 M 0 ns
Integer addition (wrong) 0 ns 16040 ns 1.52 M 0 ns
Integer addition 0 ns 9818 ns 1.52 M 4 ns 335 ns 7385903 ns 952.30 K 354 ns
Memory usage statistics:
Name Memory usage
String concatention (wrong) 0 B
++ array concat (wrong) 0 B
noop 0 B
adding a head to an array (wrong) 0 B
Integer addition (wrong) 0 B
Integer addition 0 B 424 B
**All measurements for memory usage were the same**
view raw output hosted with ❤ by GitHub

You can see that the averages aren’t 0 ns because sometimes the measured run time is very high – garbage collection and such. That’s also why the standard deviation is huge (big difference from 0 to 23000 or so). However, if you look at the median (basically if you sort all measured values, it’s the value is in the middle) and the mode (the most common value) you see that both of them are 0. Even the accompanying memory measurements are 0. Seems like there isn’t much happening there.

So why is that? The compiler optimizes these “benchmarks” away, because they evaluate to one static value that can be determined at compile time. If you write 1 + 1 – the compiler knows you probably mean 2. Smart compilers. To avoid these, we have to trick the compiler by randomizing the values, so that they’re not clear at compile time (see the “right” integer addition).

That’s the one thing we see thanks to our more accurate measurements, the other is that we can now measure how long a map over a range with 10 elements takes (which is around 355 ns for me (I trust the mode and median more her than the average).

How did we accomplish this? Well it all started looking into why measurements on Windows seemed to be weird. We noticed that the implementation of had hard coded the values to be measured in micro seconds:

tc(F) ->
T1 = erlang:monotonic_time(),
Val = F(),
T2 = erlang:monotonic_time(),
Time = erlang:convert_time_unit(T2 T1, native, microsecond),
{Time, Val}.
view raw timer.erl hosted with ❤ by GitHub

But, in fact nanoseconds are supported! So we now have our own simple time measuring code. This is operating system dependent though, as the BEAM knows about native time units. To the best of our knowledge nanosecond precision is available on Linux and MacOS – not on Windows.

It wasn’t just enough to switch to nano second precision though. See, once you get down to nanoseconds the overhead of simply invoking an anonymous function (which benchee needs to do a lot) becomes noticeable. On my system this overhead is 78 nanoseconds. To compensate, benchee now measures the function call overhead and deducts it from the measured times. That’s how we can achieve measurements of 0ns above – all the code does is return a constant as the compiler optimized it away as the value can be determined at compile time.

A nice side effect is that the overhead heavy function repetition is practically not used anymore on Linux and macOS as no function is faster than nanoseconds. Hence, no more imprecise measurements due to function repetition to make it measurable at all (on Windows we still repeat the function call for instance 100 times and then divide the measured time by this).

Formatter Configuration

This is best shown with an example, up until now if you wanted to pass options to any of the formatters you had to do it like this:
"function" => fn -> something end
formatters: [
formatter_options: [
html: [file: "output/my.html", auto_open: false, inline_assets: true]
view raw benchee.exs hosted with ❤ by GitHub

This always felt awkward to me, but it really hit hard when I watched a benchee video tutorial. There the presenter said “…here we configure the formatter to be used and then down here we configure where it should be saved to…” – why would that be in 2 different places? They could be far apart in the code. There is no immediate visible connection between Benchee.Formatters.HTML and the html: down in the formatter_options:.  Makes no sense.

That API was never really well thought out, sadly.
So, what can we do instead? Well of course, bring the options closer together:
"function" => fn -> something end
formatters: [
{Benchee.Formatters.HTML, file: "output/my.html", auto_open: false, inline_assets: true},

So, if you want to pass along options instead of just specifying the module, you specify a tuple of module and options. Easy as pie. You know exactly what formatter the options belong to.

Road to 1.0?

Honestly, 1.0 should have happened many versions ago. Right now the plan is for this to be the last release with user facing features. We’ll mingle the data structure a bit more (see the PR if interested), then put in deprecation warnings for functionality we’ll remove and call it 0.99. Then, remove deprecated functionality and call it 1.0. So, this time indeed – it should be soon ™. I have a track record of sneaking in just one more thing before 1.0 though 😅. You can track our 1.0 progress here.

Why did this take so long?

Looking at this release it’s pretty packed. It should have been 2 releases (one for every major feature described above) that should have happened much sooner.

It’s definitely sad, I double checked: measuring with best available precision landed 21st of May and function call overhead measurement was basically done 27th of June. And the formatter options landed 10th of August. Keeping those out of your hands for so long really saddens me 😖.

Basically, these required updating the formatters, which isn’t particularly fun, but necessary as I want all formatters to be ready to release along a new benchee version. In addition, we put in even more work (specifically Devon in big parts) and added support for memory measurements to all the formatters.

Beyond this? Well, I think life. Life happened. I moved apartments, which is a bunch  of work. Then a lot of things happened at work leading to me eventually quitting my job. Some times there’s just no time or head space for open source. I’m happy though that I’m as confident as one can be in that benchee is robust and bug free software, so that I don’t have to worry about it breaking all the time. I can already see this statement haunting me if this release features numerous weird bugs 😉

In that vain, hope you enjoy the new benchee version – happy to hear feedback, bugs or feature ideas!

And because you made it so far, you deserve an adorable bunny picture:


Released: benchee 0.10, HTML, CSV and JSON plugins

It’s been a little time since the last benchee release, have we been lazy? Au contraire mes ami! We’ve been hard at work, greatly improving the internals, adding a full system for hooks (before_scenarion, before_each, after_each, after_scenario) and some other great improvements thanks to many contributions. The releases are benchee 0.10.0 (CHANGELOG), benchee_csv 0.7.0 (CHANGELOG), benchee_html 0.4.0 (CHANGELOG) and benchee_json 0.4.0 (CHANGELOG).

Sooo… what’s up? Why did it take so long?


Before we take a look at the exciting new features, here’s a small summary of major things that happened in previous releases that I didn’t manage to blog about due to lack of time:

0.7.0 added mainly convenience features, but benchee_html 0.2.0 split up the HTML reports which made it easier to find what you’re looking for but also alleviated problems with rendering huge data sets (the graphing library was reaching its limits with that many graphs and input values)

0.8.0 added type specs for the major public functions, configuration is now a struct so errors out on unrecognized options

0.9.0 is one of my favorite releases as it now gathers and shows system data like number of cores, operating system, memory and cpu speed. I love this, because normally when I benchmark I and write about it I need to write it up in the blog post. Now with benchee I can just copy & paste the output and I get all the information that I need! This version also facilitates calling benchee from Erlang, so benchee:run is in the cards.

Now ahead, to the truly new stuff:


In benchee each processing step used to have its own main key in the main data structure (suite): run_times, statistics, jobs etc. Philosophically, that was great. However, it got more cumbersome in the formatters especially after the introduction of inputs as access now required an additional level of indirection (namely, the input). As a result, to get all the data for a combination of job and input you want to format you have got to merge the data of multiple different sources. Not exactly ideal. To make matters worse, we want to add memory measurements in the future… even more to merge.

Long story short, Devon and I sat down in person for 2 hours to discuss how to best deal with this, how to name it and all accompanying fields. We decided to keep all the data together from now on – for every entry of the result. That means each combination of a job you defined and an input. The data structure now keeps that along with its raw run times, statistics etc. After some research we settled on calling it a scenario.

defmodule Benchee.Benchmark.Scenario do
@moduledoc """
A Scenario in Benchee is a particular case of a whole benchmarking suite. That
is the combination of a particular function to benchmark (`job_name` and
`function`) in combination with a specific input (`input_name` and `input`).
It then gathers all data measured for this particular combination during
`Benchee.Benchmark.measure/3` (`run_times` and `memory_usages`),
which are then used later in the process by `Benchee.Statistics` to compute
the relevant statistics (`run_time_statistics` and `memory_usage_statistics`).
@type t :: %__MODULE__{
job_name: binary,
function: fun,
input_name: binary | nil,
input: any | nil,
run_times: [float] | [],
run_time_statistics: Benchee.Statistics.t | nil,
memory_usages: [non_neg_integer] | [],
memory_usage_statistics: Benchee.Statistics.t | nil,
before_each: fun | nil,
after_each: fun | nil,
before_scenario: fun | nil,
after_scenario: fun | nil

This was a huge refactoring but we really like the improvements it yielded. Devon wrote about the refactoring process in more detail.

It took a long time, but it didn’t add any new features – so no reason for a release yet. Plus, of course all formatters also needed to get updated.


Another huge chunk of work went into a hooks system that is pretty fully featured. It allows you to execute code before and after invoking the benchmark as well as setup code before a scenario starts running and teardown code for after a scenario stopped running.

That seems weird, as most of the time you won’t need hooks. We could have released with part of the system ready, but I didn’t want to (potentially) break API again and so soon if we added arguments or found that it wasn’t quite working to our liking. So, we took some time to get everything in.

So what did we want to enable you to do?

  • Load a record from the database in before_each and pass it to the benchmarking function, to perform an operation with it without counting the time for loading the record towards the benchmarking results
  • Start up a process/service in before_scenario that you need for your scenario to run, and then…
  • …shut it down again in after_scenario, or bust a cache
  • Or if you want your benchmarks to run without a cache all the time, you can also bust it in before_each or after_each
  • after_each is also passed the return value of the benchmarking function so you can run assertions on it – for instance for all the jobs to see if they are truly doing the same thing
  • before_each could also be used to randomize the input a bit to benchmark a more diverse set of inputs without the randomizing counting towards the measured times

All of these hooks can be configured either globally so that they run for all the benchmarking jobs or they can be configured on a per job basis. The documentation for hooks over at the repo is a little blog post by itself and I won’t repeat it here 😉

As a little example, here is me benchmarking hound:

# ATTENTION: gotta start phantomjs via `phantomjs –wd` first..
{:ok, server} = SimpleServer.start
Application.put_env(:hound, :app_host, "http://localhost&quot;)
Application.put_env(:hound, :app_port, SimpleServer.port(server))
use Hound.Helpers{
"fill_in text_field" => fn ->
fill_field({:name, "user[name]"}, "Chris")
"visit forms" => fn ->
"find by css #id" => fn ->
find_element(:id, "button-no-type-id")
time: 18,
formatters: [
html: [file: "benchmarks/html/hound.html"],
before_scenario: fn(input) ->
after_scenario: fn(_return) ->

Hound needs to start before we can benchmark it. Howeer, hound seems to remember the started process by the pid of self() at that time. That’s a problem because each benchee scenario runs in its own process, so you couldn’t just start it before invoking I found no way to make the benchmark work with good old benchee 0.9.0, which is also what finally brought me to implement this feature. Now in benchee 0.10.0 with before_scenario and after_scenario it is perfectly feasible!

Why no 1.0?

With all the major improvements one could easily call this a 1.0. Or 0.6.0 could have been a 1.0 then we’d be at 2.0 now – wow that sounds mature!

Well, I see 1.0 as a promise – a promise for plugin developers and others that compatibility won’t be broken easily and not soon. Can’t promise this when we just broke plugin compatibility in a major way. That said, I really feel good about the new structure, partly because we put so much time and thought into figuring it out, but also because it has greatly simplified some implementations and thinking about some future features it also makes them a lot easier to implement.

Of course, we didn’t break compatibility for users. That has been stable since 0.6.0 and to a (quite big) extent beyond that.

So, 1.0 will of course be coming some time. We might get some more bigger features in that could break compatibility (although I don’t think they will, it will just be new fields):

  • Measuring memory consumption
  • recording and loading benchmarking results
  • … ?

Also before a 1.0 release I probably want to extract more not directly benchmarking related functionality from benchee and provide as general purpose libraries. We have some sub systems that we build for us and would provide value to other applications:

  • Unit: convert units (durations, counts, memory etc.), scale them to a “best fit” unit, format them accordingly, find a best fit unit for a collection of values
  • Statistics: All the statistics we provide including not so easy/standard ones like nth percentile and mode
  • System: gather system data like elixir/erlang version, CPU, Operating System, memory, number of cores

Thanks to the design of benchee these are all already fairly separate so extracting them is more a matter of when, not how. Meaning, that we have all the functionality in those libraries that we need so that we don’t have to make a coordinated release for new features across n libraries.



Especially due to many great community contributions (maybe because of Hacktoberfest?) there’s a number of stellar improvements!

  • System information is now also available and you can toggle it with the link in the top right
  • unit scaling from benchee “core” is now also used so it’s not all in micro seconds as before but rather an appropriate unit
  • reports are automatically opened in your browser after the formatter is done (can of course be deactivated)
  • there is a default file name now so you don’t HAVE to supply it

What’s next?

Well this release took long – hope the next one won’t take as long. There’s a couple of improvements that didn’t quite make it into the release so there might be a smaller new release relatively soon. Other than that, work on either serializing or the often requested “measure memory consumption” will probably start some time. But first, we rest a bit 😉

Hope you enjoy benchmarking and if you are missing a feature or getting hit by a bug, please open an issue



Benchee 0.4.0 released – adjust what is printed

Today I made a little 0.4.0 release of my elixir benchmarking library benchee. As always the Changelog has all the details.

This release mainly focusses on making all non essential output that benchee produces optional. This is mostly rooted in user feedback of people who wanted to disable the fast execution warnings or the comparison report. I decided to go full circle and also make it configurable if benchee prints out which job it is currently benchmarking or if the general configuration information is printed. I like this sort of verbose information and progress feedback – but clearly it’s not to everyone’s taste and that’s just fine 🙂

So what’s next for benchee? As a keen github observer might have noticed I’ve taken a few stabs at rendering charts in HTML + JS for benchee and in the process created benchee_json. I’m a bit dissatisfied as of now, as I’d really want to have graphs showing error bars and that seems to be harder to come by than I thought. After D3 and chart.js I’ll probably give highcharts a stab now. However, just reading the non-commercial terms again I’m not too sure if it’s good in all sense (e.g. what happens if someone in a commercial corporation uses and generates the HTML?). Oh, but the wonders of the Internet in a new search I found plotly which seems to have some great error bars support.

Other future plans include benchmarking with multiple input sizes to see how different approaches perform or the good old topic of lessening the impact of garbage collection 🙂


Benchee 0.3.0 released – formatters, parallel benchmarking & more

Yesterday I released benchee 0.3.0! Benchee is a tool for (micro) benchmarking in elixir focussing on being simple, extensible and to provide you with good statistics. You can refer to the Changelog for detailed information about the changes. This post will look at the bigger changes and also give a bit of the why for the new features and changes.

Multiple formatters

Arguably the biggest feature in Benchee 0.3.0 is that it is now easy and built-in to configure multiple formatters for a benchmarking suite. This means that first the benchmark is run, and then multiple formatters are run on the benchmarking results. This way you can get both the console output and the corresponding csv file using BencheeCSV. This was a pain point for me before, as you could either get one or the other or you needed to use the more verbose API.

list = Enum.to_list(1..10_000)
map_fun = fn(i) -> [i, i * i] end
formatters: [
csv: %{file: "my.csv"}
"flat_map" => fn -> Enum.flat_map(list, map_fun) end,
"map.flatten" => fn -> list |> |> List.flatten end

You can also see the new output/1 methods at work, as opposed to format/1 they also really do the output themselves. BencheeCSV uses a custom configuration options to know which file to write to. This is also new, as now formatters have access to the full benchmarking suite, including configuration, raw run times and function definitions. This way they can be configured using configuration options they define themselves, or a plugin could graph all run times if it wanted to.

Of course, formatters default to just the built-in console formatter.

Parallel benchmarking

Another big addition is parallel benchmarking. In Elixir, this just feels natural to have. You can specify a parallel key in the configuration and that tells Benchee how many tasks should execute any given benchmarking job in parallel.

list = Enum.to_list(1..10_000)
map_fun = fn(i) -> [i, i * i] end{time: 3, parallel: 2}, %{
"flat_map" => fn -> Enum.flat_map(list, map_fun) end,
"map.flatten" => fn -> list |> |> List.flatten end

Of course, if you want to see how a system behaves under load – overloading might be exactly what you want to stress test the system. And this was exactly the reason why Leon contributed this change back to Benchee:

I needed to benchmark integration tests for a telephony system we wrote – with this system the tests actually interfere with each other (they’re using an Ecto repo) and I wanted to see how far I could push the system as a whole. Making this small change to Benchee worked perfectly for what I needed 🙂

(Of course it makes me extremely happy that people found adjusting Benchee for their use case simple, that’s one of the main goals of Benchee. Even better that it was contributed back ❤ )

If you want to see more information and detail about “to benchmark in parallel or not” you can check the Benchee wiki. Spoiler alert: The more parallel benchmarks run, the slower they get to an acceptable degree until the system is overloaded (more tasks execute in parallel than there are CPU cores to take care of them). Also deviation skyrockets.

While the effect seems not to be very significant for parallel: 2 on my system, the default in Benchee remains parallel: 1 for the mentioned reasons.

Print configuration information

Partly also due to the parallel change, Benchee wil now print a brief summary of the benchmarking suite before executing it.

tobi@happy ~/github/benchee $ mix run samples/run_parallel.exs

Benchmark suite executing with the following configuration:
warmup: 2.0s
time: 3.0s
parallel: 2
Estimated total run time: 10.0s

Benchmarking flat_map...
Benchmarking map.flatten...

Name&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ips&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; average&nbsp;&nbsp;&nbsp; deviation&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; median
map.flatten&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1268.15&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 788.55μs&nbsp;&nbsp;&nbsp; (±13.94%)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 759.00μs
flat_map&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 706.35&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1415.72μs&nbsp;&nbsp;&nbsp;&nbsp; (±8.56%)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1419.00μs

map.flatten&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1268.15
flat_map&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 706.35 - 1.80x slower

This was done so that when people share their benchmarks online one can easily see the configuration they ran it with. E.g. was there any warmup time? Was the amount of parallel tasks too high and therefore the results are that bad?

It also prints an estimated total run time (number of jobs * (warmup + time)), so you know if there’s enough time to go and get a coffee before a benchmark finishes.

Map instead of a list of tuples

What is also marked as a “breaking” change in the Changelog is actually not THAT breaking. The main data structure handed to was changed to a map instead of a list of tuples and all corresponding data structures changed as well (important for plugins to know).

It used to be a list of tuples because of the possibility that benchmarks with the same name would override each other. However, having benchmarks with the same name is nonsensical as you can’t discern their results in the output any way. So, this now feels like a much more fitting data structure.

The old main data structure of a list of tuples still works and while I might remove it, I don’t expect me to right now as all that is required to maintain it is 4 lines of code. This makes duplicated names no longer working the only real deprecation, although one might even call it a feature 😉

Last, but not least, this release is the first one that got some community contributions in, which makes me extremely happy. So, thanks Alvin and Leon! 😀

Shoes4 preview – the more personal release notes

As you might have noticed a big project I’ve been working on for almost 2 years got its first release on the 10th of May a bit more than 2 weeks ago. You can go read the official release announcement. This one is a bit more personal 😉


It’s been a long road from the 24th of May 2012 when the shoes community got together to concentrate efforts on the complete rewrite we call shoes4 now. That is a long time. It’s a time in which we made more than 2000 commits and closed nearly 600 issues.

Why did it take us so long? Well rewrites are difficult… there is an older piece of software in place which kind of does what you want but not really. Still if you release your new version has to be somewhat good enough. With this preview release we feel that all the major building blocks of the Shoes DSL itself are in place. Finally.

This release is really important to me. You know, it’s super hard to work on a project for so long without any release. Without any users. Other than your co-contributors there really aren’t many playing with what you build. Not many people to report bugs. Not many people to build something awesome. Not many people telling you that what you do actually matters. Sometimes this makes it a bit hard to motivate yourself. Therefore I want to thank everyone who during that time encouraged anyone of us, wrote an email saying that shoes is awesome or even grabbed a release straight from github and tried it out. Every time that happened it put a big smile on my face and motivated me to put in a couple of hours of extra work. I you!

Hackers gonna hack

Of course a release doesn’t mean that people are really using what you build. I tried to make a little effort writing the announcement blog post and sending out some announcement emails. So far we have a bit over 260 gem downloads, which is good I guess 🙂 There will hopefully be a wider coverage if we get a release out.

Of course, this is just  preview release. Nothing stable. We’re hard at work and already got some nice new bug fixes and improvements lined up. Stay tuned for the preview2 release and subsequent releases until we hit a release candidate!


Last but not least I want to thank everyone who ever contributed to shoes4 – not only code but reporting issues, just trying stuff out. Thank you really, your support means a lot! Also to whoever funds me or funded me on gittip – thank you!

More personalized thanks go out to Eric Watson and Jason R. Clark! Eric has probably been the most steady contributor to shoes4. He’s been there from the start, still is, still going strong. And hopefully will for a long time 🙂 Recently he’s been hard at work converting our specs to the rspec3 syntax. Jason on the other hand is a more recent addition to the team – his first commits date back to around September 2013. Nevertheless he’s been pretty hard at work solving vital issues and hard to crack bugs. Lately he’s been hunting down operating system handles we didn’t free up!

What impresses me most about the two of them is that they both have a family and a job but still find the time to work on open source. I hope I’ll be the same once I’m at that point in my life! I mean Eric even has 5 children! FIVE! I can’t even imagine what that’s like 🙂 . But that’s super fun too. I’ll never forget remote pair programming with him with one of his kids running around in the background and waiving. Or Jason attributing the first open source contribution of his daughter on shoes. All super fun memories.

So everybody, shoes is coming. Try the preview release. Or wait for the next one. Or the release candidate. No matter what. Shoes is coming!

Shoes on!


after_do 0.3.0 released

I just released version 0.3.0 of my little aspect oriented programming/adding callbacks to methods library after_do! You can find an introduction here.

So what is in 0.3.0? Basically 2 things:

  • after_do now works properly with modules, meaning you can attach callbacks to the methods of a module and objects of classes including those methods will call them!
  • Fixed bugs around inheritance where it could happen that a block might get called too often or not at all

At the same time this release was able to delete code and remove complexity while improving functionality. How is that possible? Well thanks to block scoping it is!

One problem I always had is to figure out callbacks of which class to execute in combination with inheritance. You know – self is always the current object and callbacks might be defined in super classes. I wanted to have a way to know which class the currently called method is defined in, not what the class of the current object is. Luckily there is one point where I know that – the moment when I add the callbacks (since they are added on that exact class/module). So we just need to save it:

callback_klazz = self
define_method method do |*args|
  callback_klazz.send(:_after_do_execute_callbacks, :before, method, self, *args)
  return_value = send(alias_name, *args)
  callback_klazz.send(:_after_do_execute_callbacks, :after, method, self, *args)

Simple yet powerful.

Enjoy the 0.3 release of after_do 🙂