Benchee 0.5.0 released – unit scaling for your convenience!

Another month and another release of my Elixir (micro-) benchmarking library benchee. This particular release focusses on unit scaling. E.g. instead of displaying that you had 12_345_678 iterations per second it will now say that there were 12.35 M iterations per second. The same goes for the time as well. There are four different strategies to choose from determining how units should be scaled. My friend and old Shoes (Ruby tooklkit/DSL for building GUIs) companion Eric Watson aka @wasnotrice did the bulk of the work. Thanks!

As usual, the nitty-gritty details are in the Changelog.

Why unit scaling?

The units employed so far were not ideal. Who really works with microseconds all the time and like to read full numbers over a million while only the first couple of places really have an impact? I think it’s easier to work with units closer to what a number really is. If something takes 5_632 microseconds to execute I’m much better off knowing that it takes about 5.63 milliseconds.

So from now on benchee will use one of its four strategies (one of which is none, if you don’t like this behaviour at all) to determine what the best unit to represent the benchmarking results in might be. For the canonical flat_map vs. map.flatten example the result might look like this:

See how the units were automatically scaled to thousands/milliseconds respectively? Now, you might not like that because you always want there to at least be a “1” before the dot. No problem, just use another scaling strategy: smallest!

This is now (in this case) pretty much like the output you’d get in previous benchee versions. Still, smallest is different from none in that if both averages were at least a millisecond they would still be displayed in milliseconds.

Under the hood this is all nicely handled by units (Count, Duration) implementing the Scale and Format behaviours while relying on a Unit struct.

What’s next for benchee?

The next bigger topic that I’ve put quite some time and experiments in is an HTML formatter with fancy graphs and an image export. Want a sneak-peak? Ok, since you asked nicely:

IPS comparisonBoxplot

Video: What did AlphaGo do to beat the strongest human Go player?

The publishing/video partner of Full Stack Fest was amazingly fast in publishing the video. Kudos to them! So after publishing the slides here goes the video!

If you want to have the slides, here they are ( or via links PDF, Speakerdeck, Slideshare):

In case you want to see it live, the talk will be up again at Codemotion Berlin.


This year AlphaGo shocked the  world by decisively beating the strongest human Go player, Lee Sedol. An accomplishment that wasn’t expected for years to come. How did AlphaGo do this? What algorithms did it use? What advances in AI made it possible? This talk will briefly introduce the game of Go, followed by the techniques and algorithms used by AlphaGo to answer these questions.

PS: Yes, Lee Sedol probably wasn’t THE STRONGEST human player – more like Top 3 or Top 5 at the time of the game (most people would probably call Ke Jie the strongest player at the moment). Lee Sedol is the dominant of the last decade though, and when the match was announced nobody on the computer-go mailing list complained about the opponent, so I just assumed he was the strongest or among the strongest but only found out after submitting the talk 🙂 . Plus, sadly “How did AlphaGo beat one of the top 5 Go Players” isn’t as catchy as a title.

Benchee 0.4.0 released – adjust what is printed

Today I made a little 0.4.0 release of my elixir benchmarking library benchee. As always the Changelog has all the details.

This release mainly focusses on making all non essential output that benchee produces optional. This is mostly rooted in user feedback of people who wanted to disable the fast execution warnings or the comparison report. I decided to go full circle and also make it configurable if benchee prints out which job it is currently benchmarking or if the general configuration information is printed. I like this sort of verbose information and progress feedback – but clearly it’s not to everyone’s taste and that’s just fine 🙂

So what’s next for benchee? As a keen github observer might have noticed I’ve taken a few stabs at rendering charts in HTML + JS for benchee and in the process created benchee_json. I’m a bit dissatisfied as of now, as I’d really want to have graphs showing error bars and that seems to be harder to come by than I thought. After D3 and chart.js I’ll probably give highcharts a stab now. However, just reading the non-commercial terms again I’m not too sure if it’s good in all sense (e.g. what happens if someone in a commercial corporation uses and generates the HTML?). Oh, but the wonders of the Internet in a new search I found plotly which seems to have some great error bars support.

Other future plans include benchmarking with multiple input sizes to see how different approaches perform or the good old topic of lessening the impact of garbage collection 🙂


Slides: What did AlphaGo do to beat the strongest human Go player?

I gave this talk at Full Stack Fest (achievement unlocked!) and I practised it before at Strange Group. It’s designed to not really require any previous knowledge (Go, Monte Carlo Tree Search and Neural Networks are all introduced). It was a lot of fun putting together and so far the feedback has also been great.

Additional shout out to the strange group folks who helped cut some content so that I landed perfectly on the 40 minutes mark 🙂

In case you want to see it live, the talk will be up again at Codemotion Berlin.


This year AlphaGo shocked the  world by decisively beating the strongest human Go player, Lee Sedol. An accomplishment that wasn’t expected for years to come. How did AlphaGo do this? What algorithms did it use? What advances in AI made it possible? This talk will briefly introduce the game of Go, followed by the techniques and algorithms used by AlphaGo to answer these questions.


Benchee 0.3.0 released – formatters, parallel benchmarking & more

Yesterday I released benchee 0.3.0! Benchee is a tool for (micro) benchmarking in elixir focussing on being simple, extensible and to provide you with good statistics. You can refer to the Changelog for detailed information about the changes. This post will look at the bigger changes and also give a bit of the why for the new features and changes.

Multiple formatters

Arguably the biggest feature in Benchee 0.3.0 is that it is now easy and built-in to configure multiple formatters for a benchmarking suite. This means that first the benchmark is run, and then multiple formatters are run on the benchmarking results. This way you can get both the console output and the corresponding csv file using BencheeCSV. This was a pain point for me before, as you could either get one or the other or you needed to use the more verbose API.

You can also see the new output/1 methods at work, as opposed to format/1 they also really do the output themselves. BencheeCSV uses a custom configuration options to know which file to write to. This is also new, as now formatters have access to the full benchmarking suite, including configuration, raw run times and function definitions. This way they can be configured using configuration options they define themselves, or a plugin could graph all run times if it wanted to.

Of course, formatters default to just the built-in console formatter.

Parallel benchmarking

Another big addition is parallel benchmarking. In Elixir, this just feels natural to have. You can specify a parallel key in the configuration and that tells Benchee how many tasks should execute any given benchmarking job in parallel.

Of course, if you want to see how a system behaves under load – overloading might be exactly what you want to stress test the system. And this was exactly the reason why Leon contributed this change back to Benchee:

I needed to benchmark integration tests for a telephony system we wrote – with this system the tests actually interfere with each other (they’re using an Ecto repo) and I wanted to see how far I could push the system as a whole. Making this small change to Benchee worked perfectly for what I needed 🙂

(Of course it makes me extremely happy that people found adjusting Benchee for their use case simple, that’s one of the main goals of Benchee. Even better that it was contributed back ❤ )

If you want to see more information and detail about “to benchmark in parallel or not” you can check the Benchee wiki. Spoiler alert: The more parallel benchmarks run, the slower they get to an acceptable degree until the system is overloaded (more tasks execute in parallel than there are CPU cores to take care of them). Also deviation skyrockets.

While the effect seems not to be very significant for parallel: 2 on my system, the default in Benchee remains parallel: 1 for the mentioned reasons.

Print configuration information

Partly also due to the parallel change, Benchee wil now print a brief summary of the benchmarking suite before executing it.

tobi@happy ~/github/benchee $ mix run samples/run_parallel.exs

Benchmark suite executing with the following configuration:
warmup: 2.0s
time: 3.0s
parallel: 2
Estimated total run time: 10.0s

Benchmarking flat_map...
Benchmarking map.flatten...

Name                  ips        average    deviation         median
map.flatten       1268.15       788.55μs    (±13.94%)       759.00μs
flat_map           706.35      1415.72μs     (±8.56%)      1419.00μs

map.flatten       1268.15
flat_map           706.35 - 1.80x slower

This was done so that when people share their benchmarks online one can easily see the configuration they ran it with. E.g. was there any warmup time? Was the amount of parallel tasks too high and therefore the results are that bad?

It also prints an estimated total run time (number of jobs * (warmup + time)), so you know if there’s enough time to go and get a coffee before a benchmark finishes.

Map instead of a list of tuples

What is also marked as a “breaking” change in the Changelog is actually not THAT breaking. The main data structure handed to was changed to a map instead of a list of tuples and all corresponding data structures changed as well (important for plugins to know).

It used to be a list of tuples because of the possibility that benchmarks with the same name would override each other. However, having benchmarks with the same name is nonsensical as you can’t discern their results in the output any way. So, this now feels like a much more fitting data structure.

The old main data structure of a list of tuples still works and while I might remove it, I don’t expect me to right now as all that is required to maintain it is 4 lines of code. This makes duplicated names no longer working the only real deprecation, although one might even call it a feature 😉

Last, but not least, this release is the first one that got some community contributions in, which makes me extremely happy. So, thanks Alvin and Leon! 😀

Tail Call Optimization in Elixir & Erlang – not as efficient and important as you probably think

(Automatic) Tail Call Optimization (TCO) is that great feature of Elixir and Erlang that everyone tells you about. It’s super fast, super cool and you should definitely always aim to make every recursive function tail-recursive. What if I told you that body-recursive functions can be faster and more memory efficient than their especially optimized tail-recursive counter parts?

Seems unlikely, doesn’t it? After all every beginners book will mention TCO, tell you how efficient it is and that you should definitely use it. Plus, maybe you’ve tried body-recusion before in language X and your call stack blew up or it was horrendously slow. I did and I thought tail-recursive functions were always better than body-recursive. Until one day, by accident, I wrote a none tail-recursive function (so TCO didn’t apply to it). Someone told me and eagerly I replaced it with its tail-recursive counterpart. Then, I stopped for a second and benchmarked it – the results were surprising to say the least.

Before we do a deep dive into the topic, let’s take a refresher on tail call optimization from Dave Thomas in Programming Elixir 1.2 (great book!):

(…) it ends up calling itself. In many languages, that adds a new frame to the stack. After a large number of messages, you might run out of memory.

This doesn’t happen in Elixir, as it implements tail-call optimization. If the last thing a function does is call itself, there’s no need to make the call. Instead, the runtime simply jumps back to the start of the function. If the recursive call has arguments, then these replace the original parameters.

Well, let’s get into this 🙂

Writing a map implementation

So let’s write an implementation of the map function. One will be body-recursive, one will be tail-recursive. I’ll add another tail-recursive implementation using ++ but no reverse and one that just does not reverse the list in the end. The one that doesn’t reverse the list of course isn’t functionally equivalent to the others as the elements are not in order, but if you wrote your own function and don’t care about ordering this might be for you. In an update here I also added a version where the argument order is different, for more on this see the results and edit6.

map_body here is the function I originally wrote. It is not tail-recursive as the last operation in this method is the list append operation, not the call to map_body. Comparing it to all the other implementations, I’d also argue it’s the easiest and most readable as we don’t have to care about accumulators or reversing the list.

Now that we have the code, let us benchmark the functions with benchee! Benchmark run on Elixir 1.3 with Erlang 19 on an i7-4790 on Linux Mint 17.3. Let’s just map over a large list and add one to each element of the list. We’ll also throw in the standard library implementation of map as a comparison baseline:

(The benchmarking was actually done by this a little more verbose but equivalent script so it generates the CSV ouput and console output on the same run using benchee’s slightly more verbose interface. Feature to make that possible with the nicer interface is planned 😉 )

For the more visual here is also a graph showcasing the results (visualized using benchee_csv):

Graphing iterations per second (higher is better)

So what do we see? The body-recursive function seems to be as fast as the version from standard library. The reported values are faster, but well within the margin of error. Plus the median of the two is the same while standard deviation is higher for the standard library version. This hints at the possibility that the worse average may be through some outliers (resulting f.ex. from Garbage Collection). The tail-recursive version with ++ is VERY SLOW but that’s because appending with ++ so frequently is a bad idea as it needs to go to the end of the linked list every time around (O(n)). But that’s not the main point.

The main point is that the tail-recursive version is about 14% slower! Even the tail-recursive version that doesn’t reverse the list is slower than the body-recursive implementation!

Here seems like a good point to interject and mention that it was brought up in the comments (see edit11) that for significantly larger lists tail-recursive implementations get faster again. You can check out the results from a 10 Million item list.

What is highly irritating and surprising to me is that the tail-recursive function with a slightly different argument order is significantly faster than my original implementation, almost 10%. And this is not a one off – it is consistently faster across a number of runs. You can see more about that implementation in edit6 below. Thankfully José Valim chimed in about the argument order adding the following:

The order of arguments will likely matter when we generate the branching code. The order of arguments will specially matter if performing binary matching. The order of function clauses matter too although I am not sure if it is measurable (if the empty clause comes first or last).

Now, maybe there is a better tail-recursive version (please tell me!) but this result is rather staggering but repeatable and consistent. So, what happened here?

An apparently common misconception

That tail-recursive functions are always faster seems to be a common misconception – common enough that it made the list of Erlang Performance Myths as “Myth: Tail-Recursive Functions are Much Faster Than Recursive Functions”! (Note: this section is currently being reworked so the name might change/link might not lead to it directly any more in the near-ish future)

To quote that:

A body-recursive list function and a tail-recursive function that calls lists:reverse/1 at the end will use the same amount of memory. lists:map/2, lists:filter/2, list comprehensions, and many other recursive functions now use the same amount of space as their tail-recursive equivalents.

So, which is faster? It depends. On Solaris/Sparc, the body-recursive function seems to be slightly faster, even for lists with a lot of elements. On the x86 architecture, tail-recursion was up to about 30% faster

The topic also recently came up on the erlang-questions mailing list again while talking about the rework of the aforementioned Erlang performance myths site (which is really worth the read!). In it Fred Hebert remarks (emphasis added by me):

In cases where all your function does is build a new list (or any other accumulator whose size is equivalent to the number of iterations and hence the stack) such as map/2 over nearly any data structure or say zip/2 over lists, body recursion may not only be simpler, but also faster and save memory over time.

He also has his own blog post on the topic.

But won’t it explode?

I had the same question. From my experience with the clojure koans I expected the body-recursive function to blow up the call stack given a large enough input. But, I didn’t manage to – no matter what I tried.

Seems it is impossible as the BEAM VM, that Erlang and Elixir run in, differs in its implementation from other VMs, the body recursion is limited by RAM:

Erlang has no recursion limit. It is tail call optimised. If the recursive call is not a tail call it is limited by available RAM

Memory consumption

So what about memory consumption? Let’s create a list with one hundred million elements (100_000_000) and map over it measuring the memory consumption. When this is done the tail-recursive version takes almost 13 Gigabytes of memory while the body-recursive version takes a bit more than 11.5 Gigabytes. Details can be found in this gist.

Why is that? Well most likely here with the large list it is because the tail recursive version needs to create a new reversed version of the accumulator to return a correct result.

Body-recursive functions all the time now?

So let’s recap, the body-recursive version of map is:

  • faster
  • consumes less memory
  • easier to read and maintain

So why shouldn’t we do this every time? Well there are other examples of course. Let’s take a look at a very dumb function deciding whether a number is even (implemented as a homage to this clojure kaons exercise that showed how the call stack blows up in Clojure without recur):

The tail-recursive version here is still 10% slower. But what about memory? Running the function with one hundred million as input takes 41 Megabyte for the tail-recursive version (mind you, this is the whole elixir process) but almost 6.7 Gigabyte for the body-recursive version. Also, for that huge input the tail-recursive version took 1.3 seconds, while the body-recursive function took 3.86 seconds. So for larger inputs, it is faster.

Stark contrast, isn’t it? That’s most likely because this time around there is no huge list to be carried around or accumulated – just a boolean and a number. Here the effect that the body-recursive function needs to save its call stack in the RAM has a much more damaging effect, as it needs to call itself one hundred million times.

Rerunning the map implementation with a significantly larger list, the tail-recursive versions also become faster than the standard library version and the body-recursive function. See edit11.

So, what now?

Tail-recursive functions still should be faster and more efficient for many or most use cases. Or that’s what I believe through years of being taught that tail call optimization leads to the fastest recursive functions ;). This post isn’t to say that TCO is bad or slow. It is here to say and highlight that there are cases where body-recursive functions are faster and more efficient than tail-recursive functions. I’m also still unsure why the tail-recursive function that does not reverse the list is still slower than the body-recursive version – it might be because it has to carry the accumulator around.

Maybe we should also take a step back in education and teaching and be more careful not to overemphasize tail call optimization and with it tail-recursive functions. Body-recursive functions can be a viable, or even superior, alternative and they should be presented as such.

There are, of course, cases where writing tail-recursive functions is absolutely vital, as Robert Virding, creator of Erlang, rightfully highlights:

No, the main case where is TCO is critical is in process top-loops. These functions never return (unless the process dies) so they will build up stack never to release it. Here you have to get it right. There are no alternatives. The same applies if you top-loop is actually composed of a set of mutually calling functions. There there are no alternatives. Sorry for pushing this again, and again, but it is critical. :slight_smile:

But what does this teach us in the end? Don’t take your assumptions stemming from other programming environments for granted. Also, don’t assume – always proof. So let’s finish with the closing words of the Erlang performance myths section on this:

So, the choice is now mostly a matter of taste. If you really do need the utmost speed, you must measure. You can no longer be sure that the tail-recursive list function always is the fastest.

edit1: there previously was a bug in is_even_tco? with a missing not, not caught by my tests as they were calling the wrong function 😦 Thanks to Puella for notifying me.

edit2/addendum: (see next edit/update) It was pointed out at that running it in an erlang session the body-recursive function was significantly slower than the tco version. Running what I believe to be equivalent code in Elixir and Erlang it seems that the Erlang map_body version is significantly slower than in Elixir (2 to 3 times by the looks of it). I’d need to run it with an Erlang benchmarking tool to confirm this, though.

edit3/addendum2: The small tries mentioned in the first addendum were run in the shell which is not a great idea, using my little erlang knowledge I made something that compiled that “benchmark” and map_body is as fast/faster again thread. Benchmarking can be fickle and wrong if not done right, so would still look forward to run this in a proper Erlang benchmarking tool or use Benchee from Erlang. But no time right now 😦

edit4: Added comment from Robert Virding regarding process top loops and how critical TCO is there. Thanks for reading, I’m honoured and surprised that one of the creators of Erlang read this 🙂 His full post is of course worth a read.

edit5: Following the rightful nitpick I don’t write “tail call optimized” functions any more but rather “tail-recursive” as tail call optimization is more of a feature of the compiler and not directly an attribute of the function

edit6: Included another version in the benchmark that swaps the argument order so that the list stays the first argument and the accumulator is the last argument. Surprisingly (yet again) this version is constantly faster than the other tail-recursive implementation but still slower than body recursive. I want to thank Paweł for pointing his version out in the comments. The reversed argument order was the only distinguishing factor I could make out in his version, not the assignment of the new accumulator. I benchmarked all the different variants multiple times. It is consistently faster, although I could never reproduce it being the fastest. For the memory consumption example it seemed to consume about 300MB less than the original tail-recursive function and was a bit faster. Also since I reran it either way I ran it with the freshly released Elixir 1.3 and Erlang 19. I also increased the runtime of the benchmark as well as the warmup (to 10 and 10) to get more consistent results overall. And I wrote a new benchmarking script so that the results shown in the graph are from the same as the console output.

edit7: Added a little TCO intro as it might be helpful for some 🙂

edit8: In case you are interested in what the bytecode looks like, I haven’t done it myself (not my area of expertise – yet) there is a gist from Sasa Juric showing how you can get the bytecode from an elixir function

edit9: Added José’s comment about the argument order, thanks for reading and commenting! 🙂

edit10: Added CPU and OS information 🙂

edit11: It was pointed out in the comments that the performance characteristics might differ for larger lists. And it does, taking 1000 times as many elements (10_000_000) all tail-recursive versions are significantly faster than the body-recursive version (and even the stdlib version).

edit12: Updated to reflect newer Benchee API as the old API with lists of tuples doesn’t work anymore 🙂


Benchee 0.2.0 – warmup & nicer console output

Less than a week after the initial release of my benchmarking library Benchee there is a new version – 0.2.0! The details are in the Changelog. That’s the what, but what about the why?


Arguably the biggest change is introduction of a warmup phase to the benchmarks. That is the benchmark jobs are first run for some time without taking measurements to simulate a “warm” already running system. I didn’t think it’d be that important as the BEAM VM isn’t JITed (as opposed to the JVM) for all hat I know. It is important once benchmarks get to be “macro” – for instance databases usually respond faster once they got used to some queries and our webservers serve most of their time “hot”.

However, even in my micro benchmarks I noticed that it could have an effect when a benchmark was moved around (being run first versus being run last). So I don’t know to what effect, but at least to a small effect there is warmup now. If you don’t want warmup – just set warmup: 0.

Nicer console output

Name                                    ips        average    deviation         median
bodyrecusrive map                  40047.87        24.97μs    (±32.55%)        25.00μs
stdlib map                         39724.07        25.17μs    (±61.41%)        25.00μs
map tco no reverse                 36388.50        27.48μs    (±23.22%)        27.00μs
map with TCO and reverse           33309.43        30.02μs    (±45.39%)        29.00μs
map with TCO and ++                  465.25      2149.40μs     (±4.84%)      2138.00μs

bodyrecusrive map                  40047.87
stdlib map                         39724.07 - 1.01x slower
map tco no reverse                 36388.50 - 1.10x slower
map with TCO and reverse           33309.43 - 1.20x slower
map with TCO and ++                  465.25 - 86.08x slower

The ouput of numbers is now aligned right, which makes them easier to read and compare, as you can see orders of magnitude differences much more easily. Also the ugly empty line at the end of the output has been removed 🙂


This is the API incompatible change. It felt weird to me in version 0.1.0 that Benchee.benchmark would already run the function given to it. Now the jobs are defined through Benchee.benchmark and kept in a datastructure (similar to the one uses). Benchee.measure then runs the jobs and measures the outcome and provides them under the new run_times key instead of overriding the jobs key. This feels much nicer overall, of course the high level is unaffected by this.

These additions already nicely improve what Benchee can do and got a couple of items off my “I want to do this in benchee” bucket list. There’s still more to come 🙂

Introducing Benchee: simple and extensible benchmarking for Elixir

If you look around this blog it becomes pretty clear that I really love (micro) benchmarking. Naturally, while working more and more with Elixir (and loving it!) I wanted to benchmark something. Sadly, the existing options I found didn’t quite satisfy me. Be it for a different focus, missing statistics, lacking documentation or other things. So I decided to roll my own, it’s not like it’d be the first time.

Of course I tried extending existing solutions but very long functions, very scarce test coverage, lots of dead and outcommented code and a rotting PR later I decided it was time to create something new. So without further ado, please meet Benchee (of course available on hex)!

What’s great about Benchee?

Benchee is easy to use, well documented and can be extended (more on that in the following paragraphs). Benchee will run each benchmarking function you give it for a given amount of time and then compute statistics from it. Statistics is where it shines in my opinion. Benchee provides you with:

  • average run time (ok – yawn)
  • iterations per second, which is great for graphs etc. as higher is better here (as opposed to average run time)
  • standard deviation, an important value in my opinion as it gives you a feeling for how certain you can be about your measurements and how much they vary. Sadly, none of the elixir benchmarking tools I looked at supplied this value.
  • median, it’s basically the middle value of your distribution and is often cited as a value that reflects the “common” outcome better than average as it cuts out outliers. I never used a (micro) benchmarking tool that provided this value, but was often asked to provide it in my benchmarks. So here it is!

Also it gives a rather nice output on the console with headers so you know what is what. An example is further down but for now let’s talk design…

Designing a Benchmarking library

The design is influenced by my favourite ruby benchmarking library: benchmark-ips. Of course I wanted it to be more of an elixirish spin and offer more options.

A lot of elixir solutions used macros. I wanted something that works purely with functions, no tricks. When I started to learn more about functional programming one of the things that stuck with me the most was that functional programming is about a series of transformations. So what do these transformations look like for benchmarking?

  1. Create a basic benchmarking configuration with things like how long should the benchmark run, should GC be enabled etc.
  2. Run individual benchmarks and record their raw execution times
  3. Compute statistics based on these raw run times per benchmark
  4. Format the statistics to be suitable for output
  5. Put out the formatted statistics to the console, a file or whatever

So what do you now, that’s exactly what the API of Benchee looks like!

list = Enum.to_list(1..10_000)
map_fun = fn(i) -> [i, i * i] end

Benchee.init(%{time: 3})
|> Benchee.benchmark("flat_map", fn -> Enum.flat_map(list, map_fun) end)
|> Benchee.benchmark("map.flatten",
                     fn -> list |> |> List.flatten end)
|> Benchee.statistics
|> Benchee.Formatters.Console.format
|> IO.puts

What’s great about this? Well it’s super flexible and flows nicely with the beloved elixir pipe operator.

Why is this flexible and extensible? Well, don’t like how Benchee runs the benchmarks? Sub in your own benchmarking function! Want more/different statistics? Go use your own function and compute your own! Want results to be displayed in a different format? Roll you own formatter! Or you just want to write the results to a file? Well, go ahead!

This is more than just cosmetics. It’d be easy to write a plugin that converts the results to some JSON format and then post them to a web service to gather benchmarking results or let it generate fancy graphs for you.

Of course, not everybody needs that flexibility. Some people might be scared away by the verboseness above. So there’s also a higher level interface that uses all the options you see above and condenses them down to one function call to efficiently define your benchmarks:

list = Enum.to_list(1..10_000)
map_fun = fn(i) -> [i, i * i] end{time: 3},
             [{"flat_map", fn -> Enum.flat_map(list, map_fun) end},
              fn -> list |> |> List.flatten end}])

Let’s see some results!

You’ve seen two different ways to run the same benchmark with Benchee now, so what’s the result and what does it look like? Well here you go:

tobi@happy ~/github/benchee $ mix run samples/run.exs
Benchmarking flat_map...
Benchmarking map.flatten...

Name                          ips            average        deviation      median
map.flatten                   1311.84        762.29μs       (±13.77%)      747.0μs
flat_map                      896.17         1115.86μs      (±9.54%)       1136.0μs

map.flatten                   1311.84
flat_map                      896.17          - 1.46x slower

So what do you know, much to my own surprise calling map first and then flattening the result is significantly faster than a one pass flat_map. Which is unlike ruby, where flat_map is over two times fast in the same scenario. So what does that tell us? Well, what we think about performance from other programming languages might not hold true. Also, that there might be a bug in flat_map – it should be faster for all that I know. Need some time to investigate 🙂

All that aside, wouldn’t a graph be nice? That’s a feature I envy benchfella for. But wait, we got this whole extensible architecture right? Generating the whole graph myself with error margins etc. might be a bit tough, though. But I got LibreOffice on my machine. A way to quickly feed my results into it would be great.

Meet BencheeCSV (the first and so far only Benchee plugin)! With it we can substitute the formatting and output steps to generate a CSV file to be consumed by a spreadsheet tool of our choice:

file =!("test.csv", [:write])
list = Enum.to_list(1..10_000)
map_fun = fn(i) -> [i, i * i] end

|> Benchee.benchmark("flat_map", fn -> Enum.flat_map(list, map_fun) end)
|> Benchee.benchmark("map.flatten",
                     fn -> list |> |> List.flatten end)
|> Benchee.statistics
|> Benchee.Formatters.CSV.format
|> Enum.each(fn(row) -> IO.write(file, row) end)

And a couple of clicks later there is a graph including error margins:


How do I get it?

Well, just add benchee or benchee_csv to the deps of your mix.exs!

def deps do
  [{:benchee, "~> 0.1.0", only: :dev}]

Then run mix deps.get, create a benchmarking folder and create your new my_benchmark.exs! More information can be found in the online documentation or at the github repository.

Anything else?

Well Benchee tries to help you, that’s why when you try to micro benchmark an extremely fast function you might happen upon this beauty of a warning:

Warning: The function you are trying to benchmark is super fast, making time measures unreliable!
Benchee won’t measure individual runs but rather run it a couple of times and report the average back. Measures will still be correct, but the overhead of running it n times goes into the measurement. Also statistical results aren’t as good, as they are based on averages now. If possible, increase the input size so that an individual run takes more than 10μs

The reason why I put it there is pretty well explained. The measurements would simply be unreliable as randomness and the measuring itself have too huge of an impact. Plus, measurements are in micro seconds – so it’s not that accurate either. I tried nano seconds but quickly discarded them as that seemed to add even more overhead.

Benchee tries to run your benchmark n times then and measure that, while it improves the situation somewhat it adds the overhead of my repeat_n function to the benchmark.

So if you can, please benchmark with higher values 🙂

Ideas for the future?

Benchee is just version 0.1.0, but a lot of work, features and thought has already gone into it. Here are features that I thought about but decided they are not necessary for a first release:

  • Turning off/reducing garbage collection: Especially micro benchmarking can be affected by garbage collection as single runs will be much slower than the others leading to a sky rocketing standard deviation and unreliable measures. Sadly,  to the best of my knowledge, one can’t turn off GC on the BEAM. But people have shown me options where I could just set a very high memory space to reduce the chance of GC. Need to play with it.
  • Auto scaling units: It’d be nice to for instance show the average time in milliseconds if a benchmark is slower or write something to the effect of “80.9 Million” iterations per second for the console output for a fast benchmark.
  • Better alignment for console output. Right now it’s left aligned, I think right alignment looks better and helps compare results.
  • Making sure Benchee is also usable for more macro benchmarks, e.g. functions that run in the matter of seconds or even minutes
  • Correlating to that, also provide the option to specify a warmup time. Elixir/Erlang isn’t JITed so it should have no impact there, but for macro benchmarks on phoenix or so with the database it should have an impact.
  • Give measuring memory consumption a shot
  • More statistics: Anything you are missing, wishing for?
  • Graph generation: A plugin to generate and share a graph right away would be nice
  • Configurable steps in Right now if you want to use a plugin you have to use the more “verbose” API of Benchee. If Benchee gains traction and plugins really become a thing it’d be nice to configure them in the high level API like %{formatter: MyFormatModule} or %{formatter: MyFormatModule.format/1}.

So that’s it – have anything you’d like to see in Benchee? Please get in touch and let me know! In any case, give Benchee a try and happy benchmarking!


Elixir 1.3’s mix xref working its magic in a real world example

The upcoming elixir 1.3, available as a release candidate, brings along many cool new features such as Calendar Types and (finally!) test groups for ExUnit.  My favorite new feature hasn’t been discussed a lot though. It’s also missing in the “What’s coming in Elixir 1.3” blog post. So, I’d like to show it off: mix xref

So what is mix xref? In the words of the changelog:

Mix v1.3 includes a new task called xref that performs cross reference checks in your code. One of such checks is the ability to find calls to modules and functions that do not exist

Which is a great addition, finding these bugs where you have a typo or forgot an alias/import at compile time! Frankly I thought/hoped elixir already did this, and it did but only sometimes (e.g. you couldn’t import a module that isn’t there).

The even better information is that mix xref is executed by default when you code (and its dependencies) are compiled – so check the warnings after upgrading!

Your tests should generally capture these bugs, but a compile time warning is even earlier in the process and not everything is well tested. We experienced that first hand last week where it took us some time to trace a test failure back to a bug in timex_ecto and opened a PR improving test coverage with failing test cases for the bug. Curios mind that I am I wanted to see if mix xref would have found that call to an undefined module and let’s see:

tobi@airship ~/github/timex_ecto $ elixir -v
Erlang/OTP 18 [erts-7.3]  [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:false]

Elixir 1.3.0-rc.0 (7881123)
tobi@airship ~/github/timex_ecto $ mix clean
tobi@airship ~/github/timex_ecto $ mix compile
Compiling 7 files (.ex)
warning: undefined protocol function dump/1 (for protocol Ecto.DataType)

warning: undefined protocol function dump/1 (for protocol Ecto.DataType)

warning: function Ecto.Type.blank?/1 is undefined or private

warning: function Ecto.Type.blank?/1 is undefined or private

warning: function Ecto.Type.blank?/1 is undefined or private

warning: function Ecto.DateTimeWithTimezone.cast/1 is undefined (module Ecto.DateTimeWithTimezone is not available)

warning: function Ecto.Type.blank?/1 is undefined or private

Generated timex_ecto app

It found not only the bug we were hitting (Ecto.DateTimeWithTimeZone is undefined) but more sources of potential bugs. Which is great, as my hope is that those would have never made it into a hex package release with mix xref.

This is why I believe that mix xref might have the biggest impact on the Elixir eco system of all changes coming in 1.3. Don’t get me wrong, I love test groups, calendar types and all the other updates. But this should directly improve the quality of both released libraries and the  development experience. Sure, these should all be caught by tests but (sadly) not everybody tests that rigorously.

There are over 30 of such problems reported by xref in the dependencies of our relatively small phoenix app (standard phoenix + edeliver + maybe 5 other dependencies and their dependencies). That’s potential crashes, bugs and dead code looming that I’d rather avoid. Hence, I’m REALLY excited for Elixir 1.3 and mix xref and you should be too 😉


Slides: Ruby to Elixir – what’s great and what you might miss

This is a talk I gave at the Polygot Tech Meetup in Berlin in April. It has parts of my previous elixir talk while adding a new perspective more directly comparing Ruby and Elixir as well as being shorter as it was used as an intro to a fun panel discussion.


Elixir and Phoenix are known for their speed, but that’s far from their only benefit. Elixir isn’t just a fast Ruby and Phoenix isn’t just Rails for Elixir. Through pattern matching, immutable data structures and new idioms your programs can not only become faster but more understandable and maintainable. While we look at the upsides we’ll also have a look at what you might be missing and could be improved.