Released: Statistex 1.0 an Elixir Statistics calculation library

I just released statistex – a library to calculate statistics based on a given data sample. It’s an extraction from my benchmarking library benchee and part of what I’ll henceforth call the “benchee library family”. As it’s been running in benchee basically since its inception I dubbed it 1.0.

The extraction is good because it helps the benchee code base focus on benchmarking and not things around it. It removes about 800 lines from the repo and it makes that code reusable for everyone else. It also makes it easier to introduce new statistics as it’s clearer that we’ll first introduce the logic inside statistex and later on just display it in benchee and friends. It’s good design.

Do we really need another statistics library?

I struggled with this question. I don’t want to split the eco system unnecessarily. To be honest, I thought there was no other library out there. There are at least statistics and Numerix, though. Not sure how I missed them when I started incorporating more advanced statistics.

Both include more functions than just for statistics: general math and more (drawing of random values for instance). They also support more statistical values than statistex at the time of this writing. So if you’re looking for something, that statistex doesn’t provide (yet) these are some of the first places I’d look.

Why would you still want to use statistex? Well there are some things it offers that the others don’t (to the best of my knowledge):

statistics/2 – Give me everything!

As mentioned before, statistex was extracted and the way it was used in benchee is just “give me all the statistics so that formatters can display them!” – so there’s a function that does exactly this:


iex> samples = [1, 3.0, 2.35, 11.0, 1.37, 35, 5.5, 10, 0, 2.35]
iex> Statistex.statistics(samples)
%Statistex{
average: 7.156999999999999,
frequency_distribution: %{
0 => 1,
1 => 1,
10 => 1,
35 => 1,
1.37 => 1,
2.35 => 2,
3.0 => 1,
5.5 => 1,
11.0 => 1
},
maximum: 35,
median: 2.675,
minimum: 0,
mode: 2.35,
percentiles: %{50 => 2.675},
sample_size: 10,
standard_deviation: 10.47189577445799,
standard_deviation_ratio: 1.46316833512058,
total: 71.57,
variance: 109.6606011111111
}

view raw

statistex.exs

hosted with ❤ by GitHub

What’s cool about this? Well, little effort, big result! This is quite nice to explore data sets in iex for instance. Behind the scenes statistex reuses previously calculated values so that no value is calculated twice. For instance first you get the sampe_size and the total, both are then used to calculate the average. The average and sample_size are then passed on to calculate the variance and so forth. This way statistex is fast by not duplicating work if you want a bunch of statistical values (and benchee wants most of them).

But you don’t want all of these values but would still like to reuse previously calculated values? Got you covered!

Manually reuse previously calculated values

Say you want to calculate the variance but have already calculated the average and sample_size. Easy:


iex> Statistex.variance([4, 9, 11, 12, 17, 5, 8, 12, 12], sample_size: 9, average: 10.0)
16.0

view raw

reuse.ex

hosted with ❤ by GitHub

Like variance/2 a lot of function take an optional keyword list as arguments where you can provide previously calculated values (options are of course documented).

Raising on empty list input

Thanks to a discussion on elixir forum I made the decision to raise an ArgumentError when statistex is handed an empty list in most functions. Why? The statistics don’t make any sense at this point and it’s likely some kind of error either way/you probably want to display something other than a bunch of nils.

I know many won’t consider this a feature, I quite like the direction it pushes client code towards, though.

Is this enough for a new library?

I think so. 🙂 Especially getting all values at once and reusing previously calculated values are incredibly valuable properties. I also like that statistex is solely focussed on statistics. And the statistics it can’t compute yet? We’ll catch up on that over time. Moreover, it’s not like I spent some huge amount of work writing it as it was a simple extraction.

I’d be happy if you gave statistex a trial run and left some feedback.

 

Finding Bugs with Property Based Testing in a Statistics Calculation

I always loved the idea of Property Based Testing, it was the most mind blowing idea I encountered while learning clojure many years back. However, I always found it hard to apply in practice. I barely encountered the “easy” case where an operation was reversible (if I encode a term and then decode it again it should be equal to the original term, for instance). And the properties I came up always seemed to loose to catch any bugs.

So, I thought well let’s give it another shot. I got all these statistics functions in benchee/statistex – data should be easy to generate for them. Let’s give it a shot, we’ll surely not find a bug… or will we?

The first property based tests

If nothing else, I wanted to make sure no matter what numbers you throw at my statistics module it won’t blow up. To implement it I used stream_data:


check all samples <- list_of(float(), min_length: 1) do
stats = statistics(samples)
assert stats.sample_size >= 1
assert stats.minimum <= stats.maximum
assert stats.minimum <= stats.average
assert stats.average <= stats.maximum
assert stats.minimum <= stats.median
assert stats.median <= stats.maximum
assert stats.median == stats.percentiles[50]
assert stats.standard_deviation >= 0
assert stats.standard_deviation_ratio >= 0
# property that mode occurs in the sample omitted for brevity
end

This is what I came up with – our samples are any non empty list of floats and there are a bunch of checks that make sure the values are somewhere between minimum and maximum or bigger than 0. No way the tests are failing…

Wait, the tests are failing?!

 Failed with generated values (after 2 successful runs):

* Clause: samples <- list_of(float(), min_length: 1) Generated: [-9.0, -1.0] Assertion with >= failed
code: assert stats.standard_deviation_ratio() >= 0
left: -1.131370849898476
right: 0

Honestly, I was shocked. On closer inspection, the standard deviation ratio was negative when I said it should always be positive. As the generated sample only contains negative numbers the average is negative as well. As the ratio is calculated by dividing the standard deviation by the average it turned out to be negative. Usually I only work with positive samples, hence it never occurred before. The ratio should still always be positive so an abs/1 call fixed it.

Another property

Thinking more I came up with another property:


check all samples <- list_of(float(), min_length: 1) do
percies = percentiles(samples, [25, 50, 75, 90, 99])
assert percies[25] <= percies[50]
assert percies[50] <= percies[75]
assert percies[75] <= percies[90]
assert percies[90] <= percies[99]
end

It’s much like the first properties, just making sure the percentile values are in order as they should there is absolutely no possibility that this will fail, absolutely none, well tested code… no chance it will fail …

IT FAILED AGAIN?!?!?!

 Failed with generated values (after 4 successful runs):

* Clause: samples <- list_of(float(), min_length: 1)
Generated: [1.0, 33.0]

Assertion with <= failed
code: assert percies[25] <= percies[50]
left: 25.0
right: 17.0

Wait, the 25th percentile is bigger than the 50th percentile? No way that’s ok.

A lot of digging, googling and reading our original source for implementing percentile interpolation later I figured out the problem. Basically interpolation for small sample sizes is hard and also uncommon. We missed a clause/case stated in the source, that points out that for a too small percentile and sample size the value is to be set to the minimum.

Note that any p ≤ 1/(N+1) will simply be set to the minimum value.

Our p was 0.25 (25th percentile) and 0.25 <= 1/3. Implementing this clause (through guard clauses) fixed the test.

You can check out the full implementation and bug fixes in the PR.

Learnings

The generation part was super easy in the case shown. However, what’s impressive to me is that although the properties were very loosely defined they still uncovered 2 bugs. And that’s in code that both me and many of you have been running for quite a while in benchee. Sure, they are very specific edge cases but that’s what property based testing is good at: Finding edge cases!

If you have other ideas for properties to check, I’m happy to listen and learn. And give property based testing a shot yourselves even with very loose properties – you might be surprised what you find.

 

Introducing Benchee: simple and extensible benchmarking for Elixir

If you look around this blog it becomes pretty clear that I really love (micro) benchmarking. Naturally, while working more and more with Elixir (and loving it!) I wanted to benchmark something. Sadly, the existing options I found didn’t quite satisfy me. Be it for a different focus, missing statistics, lacking documentation or other things. So I decided to roll my own, it’s not like it’d be the first time.

Of course I tried extending existing solutions but very long functions, very scarce test coverage, lots of dead and outcommented code and a rotting PR later I decided it was time to create something new. So without further ado, please meet Benchee (of course available on hex)!

What’s great about Benchee?

Benchee is easy to use, well documented and can be extended (more on that in the following paragraphs). Benchee will run each benchmarking function you give it for a given amount of time and then compute statistics from it. Statistics is where it shines in my opinion. Benchee provides you with:

  • average run time (ok – yawn)
  • iterations per second, which is great for graphs etc. as higher is better here (as opposed to average run time)
  • standard deviation, an important value in my opinion as it gives you a feeling for how certain you can be about your measurements and how much they vary. Sadly, none of the elixir benchmarking tools I looked at supplied this value.
  • median, it’s basically the middle value of your distribution and is often cited as a value that reflects the “common” outcome better than average as it cuts out outliers. I never used a (micro) benchmarking tool that provided this value, but was often asked to provide it in my benchmarks. So here it is!

Also it gives a rather nice output on the console with headers so you know what is what. An example is further down but for now let’s talk design…

Designing a Benchmarking library

The design is influenced by my favourite ruby benchmarking library: benchmark-ips. Of course I wanted it to be more of an elixirish spin and offer more options.

A lot of elixir solutions used macros. I wanted something that works purely with functions, no tricks. When I started to learn more about functional programming one of the things that stuck with me the most was that functional programming is about a series of transformations. So what do these transformations look like for benchmarking?

  1. Create a basic benchmarking configuration with things like how long should the benchmark run, should GC be enabled etc.
  2. Run individual benchmarks and record their raw execution times
  3. Compute statistics based on these raw run times per benchmark
  4. Format the statistics to be suitable for output
  5. Put out the formatted statistics to the console, a file or whatever

So what do you now, that’s exactly what the API of Benchee looks like!


list = Enum.to_list(1..10_000)
map_fun = fn(i) -> [i, i * i] end

Benchee.init(%{time: 3})
|> Benchee.benchmark("flat_map", fn -> Enum.flat_map(list, map_fun) end)
|> Benchee.benchmark("map.flatten",
                     fn -> list |> Enum.map(map_fun) |> List.flatten end)
|> Benchee.statistics
|> Benchee.Formatters.Console.format
|> IO.puts

What’s great about this? Well it’s super flexible and flows nicely with the beloved elixir pipe operator.

Why is this flexible and extensible? Well, don’t like how Benchee runs the benchmarks? Sub in your own benchmarking function! Want more/different statistics? Go use your own function and compute your own! Want results to be displayed in a different format? Roll you own formatter! Or you just want to write the results to a file? Well, go ahead!

This is more than just cosmetics. It’d be easy to write a plugin that converts the results to some JSON format and then post them to a web service to gather benchmarking results or let it generate fancy graphs for you.

Of course, not everybody needs that flexibility. Some people might be scared away by the verboseness above. So there’s also a higher level interface that uses all the options you see above and condenses them down to one function call to efficiently define your benchmarks:

list = Enum.to_list(1..10_000)
map_fun = fn(i) -> [i, i * i] end

Benchee.run(%{time: 3},
             [{"flat_map", fn -> Enum.flat_map(list, map_fun) end},
              {"map.flatten",
              fn -> list |> Enum.map(map_fun) |> List.flatten end}])

Let’s see some results!

You’ve seen two different ways to run the same benchmark with Benchee now, so what’s the result and what does it look like? Well here you go:

tobi@happy ~/github/benchee $ mix run samples/run.exs
Benchmarking flat_map...
Benchmarking map.flatten...

Name                          ips            average        deviation      median
map.flatten                   1311.84        762.29μs       (±13.77%)      747.0μs
flat_map                      896.17         1115.86μs      (±9.54%)       1136.0μs

Comparison:
map.flatten                   1311.84
flat_map                      896.17          - 1.46x slower

So what do you know, much to my own surprise calling map first and then flattening the result is significantly faster than a one pass flat_map. Which is unlike ruby, where flat_map is over two times fast in the same scenario. So what does that tell us? Well, what we think about performance from other programming languages might not hold true. Also, that there might be a bug in flat_map – it should be faster for all that I know. Need some time to investigate 🙂

All that aside, wouldn’t a graph be nice? That’s a feature I envy benchfella for. But wait, we got this whole extensible architecture right? Generating the whole graph myself with error margins etc. might be a bit tough, though. But I got LibreOffice on my machine. A way to quickly feed my results into it would be great.

Meet BencheeCSV (the first and so far only Benchee plugin)! With it we can substitute the formatting and output steps to generate a CSV file to be consumed by a spreadsheet tool of our choice:

file = File.open!("test.csv", [:write])
list = Enum.to_list(1..10_000)
map_fun = fn(i) -> [i, i * i] end

Benchee.init
|> Benchee.benchmark("flat_map", fn -> Enum.flat_map(list, map_fun) end)
|> Benchee.benchmark("map.flatten",
                     fn -> list |> Enum.map(map_fun) |> List.flatten end)
|> Benchee.statistics
|> Benchee.Formatters.CSV.format
|> Enum.each(fn(row) -> IO.write(file, row) end)

And a couple of clicks later there is a graph including error margins:

benchee_csv

How do I get it?

Well, just add benchee or benchee_csv to the deps of your mix.exs!

def deps do
  [{:benchee, "~> 0.1.0", only: :dev}]
end

Then run mix deps.get, create a benchmarking folder and create your new my_benchmark.exs! More information can be found in the online documentation or at the github repository.

Anything else?

Well Benchee tries to help you, that’s why when you try to micro benchmark an extremely fast function you might happen upon this beauty of a warning:

Warning: The function you are trying to benchmark is super fast, making time measures unreliable!
Benchee won’t measure individual runs but rather run it a couple of times and report the average back. Measures will still be correct, but the overhead of running it n times goes into the measurement. Also statistical results aren’t as good, as they are based on averages now. If possible, increase the input size so that an individual run takes more than 10μs

The reason why I put it there is pretty well explained. The measurements would simply be unreliable as randomness and the measuring itself have too huge of an impact. Plus, measurements are in micro seconds – so it’s not that accurate either. I tried nano seconds but quickly discarded them as that seemed to add even more overhead.

Benchee tries to run your benchmark n times then and measure that, while it improves the situation somewhat it adds the overhead of my repeat_n function to the benchmark.

So if you can, please benchmark with higher values 🙂

Ideas for the future?

Benchee is just version 0.1.0, but a lot of work, features and thought has already gone into it. Here are features that I thought about but decided they are not necessary for a first release:

  • Turning off/reducing garbage collection: Especially micro benchmarking can be affected by garbage collection as single runs will be much slower than the others leading to a sky rocketing standard deviation and unreliable measures. Sadly,  to the best of my knowledge, one can’t turn off GC on the BEAM. But people have shown me options where I could just set a very high memory space to reduce the chance of GC. Need to play with it.
  • Auto scaling units: It’d be nice to for instance show the average time in milliseconds if a benchmark is slower or write something to the effect of “80.9 Million” iterations per second for the console output for a fast benchmark.
  • Better alignment for console output. Right now it’s left aligned, I think right alignment looks better and helps compare results.
  • Making sure Benchee is also usable for more macro benchmarks, e.g. functions that run in the matter of seconds or even minutes
  • Correlating to that, also provide the option to specify a warmup time. Elixir/Erlang isn’t JITed so it should have no impact there, but for macro benchmarks on phoenix or so with the database it should have an impact.
  • Give measuring memory consumption a shot
  • More statistics: Anything you are missing, wishing for?
  • Graph generation: A plugin to generate and share a graph right away would be nice
  • Configurable steps in Benchee.run: Right now if you want to use a plugin you have to use the more “verbose” API of Benchee. If Benchee gains traction and plugins really become a thing it’d be nice to configure them in the high level API like %{formatter: MyFormatModule} or %{formatter: MyFormatModule.format/1}.

So that’s it – have anything you’d like to see in Benchee? Please get in touch and let me know! In any case, give Benchee a try and happy benchmarking!