Released: Statistex 1.0 an Elixir Statistics calculation library

I just released statistex – a library to calculate statistics based on a given data sample. It’s an extraction from my benchmarking library benchee and part of what I’ll henceforth call the “benchee library family”. As it’s been running in benchee basically since its inception I dubbed it 1.0.

The extraction is good because it helps the benchee code base focus on benchmarking and not things around it. It removes about 800 lines from the repo and it makes that code reusable for everyone else. It also makes it easier to introduce new statistics as it’s clearer that we’ll first introduce the logic inside statistex and later on just display it in benchee and friends. It’s good design.

Do we really need another statistics library?

I struggled with this question. I don’t want to split the eco system unnecessarily. To be honest, I thought there was no other library out there. There are at least statistics and Numerix, though. Not sure how I missed them when I started incorporating more advanced statistics.

Both include more functions than just for statistics: general math and more (drawing of random values for instance). They also support more statistical values than statistex at the time of this writing. So if you’re looking for something, that statistex doesn’t provide (yet) these are some of the first places I’d look.

Why would you still want to use statistex? Well there are some things it offers that the others don’t (to the best of my knowledge):

statistics/2 – Give me everything!

As mentioned before, statistex was extracted and the way it was used in benchee is just “give me all the statistics so that formatters can display them!” – so there’s a function that does exactly this:

What’s cool about this? Well, little effort, big result! This is quite nice to explore data sets in iex for instance. Behind the scenes statistex reuses previously calculated values so that no value is calculated twice. For instance first you get the sampe_size and the total, both are then used to calculate the average. The average and sample_size are then passed on to calculate the variance and so forth. This way statistex is fast by not duplicating work if you want a bunch of statistical values (and benchee wants most of them).

But you don’t want all of these values but would still like to reuse previously calculated values? Got you covered!

Manually reuse previously calculated values

Say you want to calculate the variance but have already calculated the average and sample_size. Easy:

Like variance/2 a lot of function take an optional keyword list as arguments where you can provide previously calculated values (options are of course documented).

Raising on empty list input

Thanks to a discussion on elixir forum I made the decision to raise an ArgumentError when statistex is handed an empty list in most functions. Why? The statistics don’t make any sense at this point and it’s likely some kind of error either way/you probably want to display something other than a bunch of nils.

I know many won’t consider this a feature, I quite like the direction it pushes client code towards, though.

Is this enough for a new library?

I think so. 🙂 Especially getting all values at once and reusing previously calculated values are incredibly valuable properties. I also like that statistex is solely focussed on statistics. And the statistics it can’t compute yet? We’ll catch up on that over time. Moreover, it’s not like I spent some huge amount of work writing it as it was a simple extraction.

I’d be happy if you gave statistex a trial run and left some feedback.

 

Released: benchee 0.10, HTML, CSV and JSON plugins

It’s been a little time since the last benchee release, have we been lazy? Au contraire mes ami! We’ve been hard at work, greatly improving the internals, adding a full system for hooks (before_scenarion, before_each, after_each, after_scenario) and some other great improvements thanks to many contributions. The releases are benchee 0.10.0 (CHANGELOG), benchee_csv 0.7.0 (CHANGELOG), benchee_html 0.4.0 (CHANGELOG) and benchee_json 0.4.0 (CHANGELOG).

Sooo… what’s up? Why did it take so long?

benchee

Before we take a look at the exciting new features, here’s a small summary of major things that happened in previous releases that I didn’t manage to blog about due to lack of time:

0.7.0 added mainly convenience features, but benchee_html 0.2.0 split up the HTML reports which made it easier to find what you’re looking for but also alleviated problems with rendering huge data sets (the graphing library was reaching its limits with that many graphs and input values)

0.8.0 added type specs for the major public functions, configuration is now a struct so errors out on unrecognized options

0.9.0 is one of my favorite releases as it now gathers and shows system data like number of cores, operating system, memory and cpu speed. I love this, because normally when I benchmark I and write about it I need to write it up in the blog post. Now with benchee I can just copy & paste the output and I get all the information that I need! This version also facilitates calling benchee from Erlang, so benchee:run is in the cards.

Now ahead, to the truly new stuff:

Scenarios

In benchee each processing step used to have its own main key in the main data structure (suite): run_times, statistics, jobs etc. Philosophically, that was great. However, it got more cumbersome in the formatters especially after the introduction of inputs as access now required an additional level of indirection (namely, the input). As a result, to get all the data for a combination of job and input you want to format you have got to merge the data of multiple different sources. Not exactly ideal. To make matters worse, we want to add memory measurements in the future… even more to merge.

Long story short, Devon and I sat down in person for 2 hours to discuss how to best deal with this, how to name it and all accompanying fields. We decided to keep all the data together from now on – for every entry of the result. That means each combination of a job you defined and an input. The data structure now keeps that along with its raw run times, statistics etc. After some research we settled on calling it a scenario.

This was a huge refactoring but we really like the improvements it yielded. Devon wrote about the refactoring process in more detail.

It took a long time, but it didn’t add any new features – so no reason for a release yet. Plus, of course all formatters also needed to get updated.

Hooks

Another huge chunk of work went into a hooks system that is pretty fully featured. It allows you to execute code before and after invoking the benchmark as well as setup code before a scenario starts running and teardown code for after a scenario stopped running.

That seems weird, as most of the time you won’t need hooks. We could have released with part of the system ready, but I didn’t want to (potentially) break API again and so soon if we added arguments or found that it wasn’t quite working to our liking. So, we took some time to get everything in.

So what did we want to enable you to do?

  • Load a record from the database in before_each and pass it to the benchmarking function, to perform an operation with it without counting the time for loading the record towards the benchmarking results
  • Start up a process/service in before_scenario that you need for your scenario to run, and then…
  • …shut it down again in after_scenario, or bust a cache
  • Or if you want your benchmarks to run without a cache all the time, you can also bust it in before_each or after_each
  • after_each is also passed the return value of the benchmarking function so you can run assertions on it – for instance for all the jobs to see if they are truly doing the same thing
  • before_each could also be used to randomize the input a bit to benchmark a more diverse set of inputs without the randomizing counting towards the measured times

All of these hooks can be configured either globally so that they run for all the benchmarking jobs or they can be configured on a per job basis. The documentation for hooks over at the repo is a little blog post by itself and I won’t repeat it here 😉

As a little example, here is me benchmarking hound:

Hound needs to start before we can benchmark it. Howeer, hound seems to remember the started process by the pid of self() at that time. That’s a problem because each benchee scenario runs in its own process, so you couldn’t just start it before invoking Benchee.run. I found no way to make the benchmark work with good old benchee 0.9.0, which is also what finally brought me to implement this feature. Now in benchee 0.10.0 with before_scenario and after_scenario it is perfectly feasible!

Why no 1.0?

With all the major improvements one could easily call this a 1.0. Or 0.6.0 could have been a 1.0 then we’d be at 2.0 now – wow that sounds mature!

Well, I see 1.0 as a promise – a promise for plugin developers and others that compatibility won’t be broken easily and not soon. Can’t promise this when we just broke plugin compatibility in a major way. That said, I really feel good about the new structure, partly because we put so much time and thought into figuring it out, but also because it has greatly simplified some implementations and thinking about some future features it also makes them a lot easier to implement.

Of course, we didn’t break compatibility for users. That has been stable since 0.6.0 and to a (quite big) extent beyond that.

So, 1.0 will of course be coming some time. We might get some more bigger features in that could break compatibility (although I don’t think they will, it will just be new fields):

  • Measuring memory consumption
  • recording and loading benchmarking results
  • … ?

Also before a 1.0 release I probably want to extract more not directly benchmarking related functionality from benchee and provide as general purpose libraries. We have some sub systems that we build for us and would provide value to other applications:

  • Unit: convert units (durations, counts, memory etc.), scale them to a “best fit” unit, format them accordingly, find a best fit unit for a collection of values
  • Statistics: All the statistics we provide including not so easy/standard ones like nth percentile and mode
  • System: gather system data like elixir/erlang version, CPU, Operating System, memory, number of cores

Thanks to the design of benchee these are all already fairly separate so extracting them is more a matter of when, not how. Meaning, that we have all the functionality in those libraries that we need so that we don’t have to make a coordinated release for new features across n libraries.

benchee_html

Selection_045.png

Especially due to many great community contributions (maybe because of Hacktoberfest?) there’s a number of stellar improvements!

  • System information is now also available and you can toggle it with the link in the top right
  • unit scaling from benchee “core” is now also used so it’s not all in micro seconds as before but rather an appropriate unit
  • reports are automatically opened in your browser after the formatter is done (can of course be deactivated)
  • there is a default file name now so you don’t HAVE to supply it

What’s next?

Well this release took long – hope the next one won’t take as long. There’s a couple of improvements that didn’t quite make it into the release so there might be a smaller new release relatively soon. Other than that, work on either serializing or the often requested “measure memory consumption” will probably start some time. But first, we rest a bit 😉

Hope you enjoy benchmarking and if you are missing a feature or getting hit by a bug, please open an issue

 

 

Benchee 0.3.0 released – formatters, parallel benchmarking & more

Yesterday I released benchee 0.3.0! Benchee is a tool for (micro) benchmarking in elixir focussing on being simple, extensible and to provide you with good statistics. You can refer to the Changelog for detailed information about the changes. This post will look at the bigger changes and also give a bit of the why for the new features and changes.

Multiple formatters

Arguably the biggest feature in Benchee 0.3.0 is that it is now easy and built-in to configure multiple formatters for a benchmarking suite. This means that first the benchmark is run, and then multiple formatters are run on the benchmarking results. This way you can get both the console output and the corresponding csv file using BencheeCSV. This was a pain point for me before, as you could either get one or the other or you needed to use the more verbose API.

You can also see the new output/1 methods at work, as opposed to format/1 they also really do the output themselves. BencheeCSV uses a custom configuration options to know which file to write to. This is also new, as now formatters have access to the full benchmarking suite, including configuration, raw run times and function definitions. This way they can be configured using configuration options they define themselves, or a plugin could graph all run times if it wanted to.

Of course, formatters default to just the built-in console formatter.

Parallel benchmarking

Another big addition is parallel benchmarking. In Elixir, this just feels natural to have. You can specify a parallel key in the configuration and that tells Benchee how many tasks should execute any given benchmarking job in parallel.

Of course, if you want to see how a system behaves under load – overloading might be exactly what you want to stress test the system. And this was exactly the reason why Leon contributed this change back to Benchee:

I needed to benchmark integration tests for a telephony system we wrote – with this system the tests actually interfere with each other (they’re using an Ecto repo) and I wanted to see how far I could push the system as a whole. Making this small change to Benchee worked perfectly for what I needed 🙂

(Of course it makes me extremely happy that people found adjusting Benchee for their use case simple, that’s one of the main goals of Benchee. Even better that it was contributed back ❤ )

If you want to see more information and detail about “to benchmark in parallel or not” you can check the Benchee wiki. Spoiler alert: The more parallel benchmarks run, the slower they get to an acceptable degree until the system is overloaded (more tasks execute in parallel than there are CPU cores to take care of them). Also deviation skyrockets.

While the effect seems not to be very significant for parallel: 2 on my system, the default in Benchee remains parallel: 1 for the mentioned reasons.

Print configuration information

Partly also due to the parallel change, Benchee wil now print a brief summary of the benchmarking suite before executing it.


tobi@happy ~/github/benchee $ mix run samples/run_parallel.exs

Benchmark suite executing with the following configuration:
warmup: 2.0s
time: 3.0s
parallel: 2
Estimated total run time: 10.0s

Benchmarking flat_map...
Benchmarking map.flatten...

Name                  ips        average    deviation         median
map.flatten       1268.15       788.55μs    (±13.94%)       759.00μs
flat_map           706.35      1415.72μs     (±8.56%)      1419.00μs

Comparison:
map.flatten       1268.15
flat_map           706.35 - 1.80x slower

This was done so that when people share their benchmarks online one can easily see the configuration they ran it with. E.g. was there any warmup time? Was the amount of parallel tasks too high and therefore the results are that bad?

It also prints an estimated total run time (number of jobs * (warmup + time)), so you know if there’s enough time to go and get a coffee before a benchmark finishes.

Map instead of a list of tuples

What is also marked as a “breaking” change in the Changelog is actually not THAT breaking. The main data structure handed to Benchee.run was changed to a map instead of a list of tuples and all corresponding data structures changed as well (important for plugins to know).

It used to be a list of tuples because of the possibility that benchmarks with the same name would override each other. However, having benchmarks with the same name is nonsensical as you can’t discern their results in the output any way. So, this now feels like a much more fitting data structure.

The old main data structure of a list of tuples still works and while I might remove it, I don’t expect me to right now as all that is required to maintain it is 4 lines of code. This makes duplicated names no longer working the only real deprecation, although one might even call it a feature 😉

Last, but not least, this release is the first one that got some community contributions in, which makes me extremely happy. So, thanks Alvin and Leon! 😀