What’s the Fastest Data Structure to Implement a Game Board in Elixir?

Ever wanted to implement something board game like in Elixir? Chess? Go? Islands? Well, then you’re gonna need a board!

But what data structure would be the most efficient one to use in Elixir? Conventional wisdom for a lot of programming languages is to use some sort of array. However, most programming languages with immutable data structures don’t have a “real” array data structure (we’ll talk about erlangs array later, it’s not really like the arrays in non functional languages) . Elixir is one of those languages.

As I like board games this was one of the first questions I ever asked the community. It’s also an interesting and relatable example to see and understand the performance trade-offs of different data structures.

Complete sources can be found in my elixir_boards_benchmark repo.

Benchmark Design

For this benchmark I didn’t have a very specific board game in mind so I settled for a board size of 9×9 . It’s a bit bigger than a normal chess board (8×8), it’s exactly the size of the smallest “normal” Go-board and it’s one smaller than the board used in Islands implemented in Functional Web Development with Elixir, OTP and Phoenix, so it seemed like a good compromise. Different sizes are likely to sport different performance characteristics.

Without a concrete usage scenario in mind I settled on a couple of different benchmarks:

  • getting a value at the coordinates (0,0), (4, 4) and (8,8). This is a fairly nano/micro benchmark for data access and provides a good balance of values at the beginning/middle/end when thinking in list terms.
  • setting a value at the coordinates (0,0), (4, 4) and (8,8).
  • a still nano/micro benchmark that combines the two previous benchmarks by getting and setting all three mentioned values. I call this “mixed bag”.
  • Why stop at the previous one? The last benchmark just sets and gets every possible coordinate once (first it sets (0,0) then gets it, then it sets (0, 1), then gets it and so forth). This also simulates the board filling which can be important for some data structures. Completely filling a board is unrealistic for most board games however, as most games finish before this stage. This one is called “getting and setting full board”.

Something that is notably not benchmarked is the creation of boards. For (almost) all of the board implementations it could resolve to a constant value which should be similar in the time it takes to create. I wasn’t overly interested in that property and didn’t want to make the code less readable by inlining the constant after creation when I didn’t need to.

Also noteworthy is that these benchmark mostly treat reading and writing equally while in my experience most AIs/bots are much more read-heavy than write-heavy.

Take all these caveats of the benchmark design into consideration when looking at the results and if in doubt of course best write your own benchmark taking into account the concrete usage patterns of your domain.

Without further ado then let’s look at the different implementations I have benchmarked so far:

Contenders

All boards need to implement a simple Board behaviour:


defmodule Board do
# can't be more specific witht types as each implementation has its own representation
@type board :: any
@type field :: any
@callback new() :: board
@callback get(board, non_neg_integer, non_neg_integer) :: field
@callback set(board, non_neg_integer, non_neg_integer, field) :: board
end

view raw

board.ex

hosted with ❤ by GitHub

All boards are built so that accessing a previously unset field will return nil. No assumptions about the data stored in the board have been made, which rules out String as an implementation type. In the benchmarks atoms are used as values.

In the descriptions of the data types below (x, y) is used to mark where what value is stored.

  • List2D: A 2 dimensional list representing rows and columns: [[(0, 0), (0, 1), (0, 2), ...], [(1, 0), (1, 1), ..], ..., [..., (8, 8)]]
  • List1D: Using the knowledge of a constant board size you can encode it into a one-dimensional list resolving the index as dimension * x + y: [(0, 0), (0, 1), (0, 2), ..., (1, 0), (1, 1), ..., (8, 8)]
  • Tuple2D: Basically like List2D but with tuples instead of lists: {{(0, 0), (0, 1), (0, 2), ...}, {(1, 0), (1, 1), ..}, ..., {..., (8, 8)}}
  • Tuple1D: Basically like List1D but with a tuple instead of a list: {(0, 0), (0, 1), (0, 2), ..., (1, 0), (1, 1),... (8, 8)}
  • Array2D: erlang arrays aren’t exactly a common sight, even learn you some Erlang basically skims over them and says to be cautious when using them. I even forgot about them for the first version of this post 😅. They internally map to tuple usage in an interesting way that will be discussed/illustrated further below. With that out of the way, conceptually this is much like Tuple2D.
  • Array1D: see above for the data structure in general, otherwise conceptually like Tuple1D.
  • MapTuple: A map that takes the tuple of the coordinates {x, y} as the key with the value  being whatever is on the board: %{{0, 0} => (0, 0), {0, 1} => (0, 1), ..., {8, 8} => (8, 8)}. It’s a bit unfair compared to others shown so far as it can start with an empty map which of course is a much smaller data structure that is not only smaller but usually faster to retrieve values from. As the benchmarks start with an empty board that’s a massive advantage, so I also included a full map in the benchmark, see next/
  • MapTupleFull: Basically the same as above but initialized to already hold all key value pairs initialized as nil. Serves not only the purpose to see how this performs but also to see how MapTuple performs once it has “filled up”.
  • MapTupleHalfFull: Only looking at complete full performance and empty performance didn’t seem good either, so I added another one initialized from 0 to 4 on all columns (a bit more than a board half, totalling 45 key/value pairs).
  • MapTupleQuarterFull: Another one of these, this time with 27 key/value pairs. Why? Because there is an interesting performance characteristic, read on to find out 🙂
  • Map2D: Akin to List2D etc. a map of maps: %{0 => %{0 => (0, 0), 1 => (0, 1), ...}, 1 => %{0 => (1, 0), ...}, ..., 8 => %{..., 8 => (8, 8)}}
  • ETSSet: erlang ETS storage with table type set. Storage layout wise it’s basically the same as MapTuple, with a tuple of coordinates pointing at the stored value.
  • ETSOrderedSet: Same as above but with table type ordered_set.
  • ProcessDictionary: On a special request for Michał 😉 This is probably not a great default variant as you’re practically creating (process-) global state which means you can’t have two boards within the same process without causing mayham. Also might accidentally conflict with other code using the process dictionary. Still might be worth considering if you want to always run a board in its own process.

It’s significant to point out that all mentioned data types except for ETS and the process dictionary are immutable. This means that especially for those in the benchmark a new board is created in a before_each hook (does not count towards measured time) to avoid “contamination”.

Another notable exception (save for String for the aforementioned constraints) is Record. Records are internally represented as tuples but give you the key/value access of maps, however in elixir it is more common to use Structs (which are backed by maps). As both maps and tuples are already present in the benchmark including these likely wouldn’t lead to new insights.

System Setup

Operating System Linux
CPU Information Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Number of Available Cores 8
Available Memory 15.61 GB
Elixir Version 1.8.2
Erlang Version 22.0

Benchmarking Results

Benchmarks of course were run with benchee and the benchmarking script is here (nothing too fancy).

You can check them out in the repo as markdown (thanks to benchee_markdown) or HTML reports (benchee_html). Careful though if you’re on mobile some of the HTML reports contain the raw measurements and hence go up to 9MB in size and can take a while to load also due to the JS drawing graphs!

The results of getting and setting full board:

runtime_final.png
getting and setting full board iterations per second (higher is better)

It’s a tight race at the top when it comes to run time! Tupl1D, Tuple2D and MapTuple are all within striking range of each other and then there’s a sharp fall off.

Also there is a fair bit of variance involved as shown by the black “whiskers” (this is usual for benchmarks that finish in nanoseconds or microseconds because of garbage collection, interference etc.). Which one of these is best? To get a better picture let’s look at the whole table of results:

Name IPS Average Deviation Median Mode Minimum Maximum
Tuple1D 133.95 K 7.47 μs ±23.29% 6.93 μs 6.88 μs 6.72 μs 492.37 μs
Tuple2D 132.16 K 7.57 μs ±29.17% 7.21 μs 7.16 μs 7.03 μs 683.60 μs
MapTuple 126.54 K 7.90 μs ±25.69% 7.59 μs 7.56 μs 7.43 μs 537.56 μs
ProcessDictionary 64.68 K 15.46 μs ±14.61% 15.12 μs 15.05 μs 14.89 μs 382.73 μs
ETSSet 60.35 K 16.57 μs ±9.17% 16.04 μs 15.95 μs 15.79 μs 161.51 μs
Array2D 56.76 K 17.62 μs ±17.45% 17.15 μs 17.04 μs 16.54 μs 743.46 μs
MapTupleFull 55.44 K 18.04 μs ±11.00% 16.92 μs 16.59 μs 16.43 μs 141.19 μs
MapTupleHalfFull 53.70 K 18.62 μs ±8.36% 17.96 μs 17.87 μs 17.67 μs 160.86 μs
Array1D 50.74 K 19.71 μs ±10.60% 19.29 μs 18.99 μs 18.81 μs 469.97 μs
ETSOrderedSet 39.53 K 25.30 μs ±10.51% 24.82 μs 24.57 μs 24.34 μs 390.32 μs
Map2D 36.24 K 27.59 μs ±8.32% 27.71 μs 25.90 μs 25.12 μs 179.98 μs
List2D 29.65 K 33.73 μs ±4.12% 33.31 μs 33.04 μs 31.66 μs 218.55 μs
MapTupleQuarterFull 28.23 K 35.42 μs ±3.86% 34.96 μs 34.61 μs 34.39 μs 189.68 μs
List1D 15.41 K 64.90 μs ±2.84% 64.91 μs 64.14 μs 62.41 μs 175.26 μs

Median, and Mode are good values to look at when unsure what is usually fastest. These values are the “middle value” and the most common respectively, as such they are much less likely to be impacted by outliers (garbage collection and such).  These seem to reinforce that Tuple1D is really the fastest, if by a negligible margin.

MapTuple is very fast, but its sibling MapTupleFull, that already starts “full”, is more than 2 times slower. Whether this is significant for you depends if you start with a truly empty board (Go starts with an empty board, chess doesn’t for instance).

Somewhat expectedly List1D does worst as getting values towards to the end of the list it has to traverse the entire list which is incredibly slow.

As an aside, it’s easy to see in the box plot that the high deviation is mainly caused by some very big outliers:

boxplot_final.png
Boxplot of getting and setting full board – dots are outliers

The dots denote outliers and they are so big (but few) that the rest of the chart is practically unreadable as all that remains from the actual box is practically a thick line.

What about memory consumption?

memory_final.png
getting and setting full board memory usage (lower is better)

Here we can see the immediate drawback of Tuple1D – it’s memory consumption is many times worse than that of the others. My (educated) guess is that it’s because it has to replace/copy/update the whole tuple with it’s 9*9 = 81 values for every update operation. Tuple2D is much more economical here, as it only needs to to update the tuple holding the columns and the one holding the specific column we’re updating (2 * 9 = 18) to the best of my understanding.

Big Tuples like this are relatively uncommon in “the real world” in my experience though as their fixed size nature makes them inapplicable for a lot of cases. Luckily, our case isn’t one of them.

MapTuple does amazingly well overall as it’s probably the structure quite some people would have intuitively reached for for good constant memory access speed. It’s memory consumption is also impressively low.

ProcessDictionary is very memory efficient and also constantly in the top 4 when it comes to run time. However, at least run time wise there’s quite the margin ~15 μs to ~7 μs which doesn’t seem to make the risks worth it overall.

Other Observations

Let’s take a look at some other things that seem note worthy:

ETS isn’t the winner

This surprised me a bit (however I haven’t used ETS much). ETS was always tagged as the go to option for performance in my mind. Looking at the docs and use cases I know it makes sense though – we’re likely to see benefits for much larger data sets as ours is relatively small:

These (ETS) provide the ability to store very large quantities of data in an Erlang runtime system, and to have constant access time to the data.

81 values hardly qualifies as “very large”.

Don’t blindly follow conventional “wisdom” – always benchmark! 💪

get(0,0) vs. get(8,8)

Let’s have a look at some of the time it takes to retrieve a value – usually a much more common operation than writing:

get(0,0)

Name IPS Average Deviation Median Mode Minimum Maximum
Tuple1D 44.12 M 22.66 ns ±842.77% 20 ns 20 ns 9 ns 35101 ns
Tuple2D 42.46 M 23.55 ns ±846.67% 20 ns 19 ns 7 ns 36475 ns
Array1D 30.38 M 32.92 ns ±84.61% 32 ns 32 ns 20 ns 8945 ns
MapTuple 29.09 M 34.38 ns ±111.15% 32 ns 31 ns 19 ns 10100 ns
MapTupleQuarterFull 18.86 M 53.03 ns ±37.27% 50 ns 49 ns 38 ns 2579 ns
Array2D 18.62 M 53.70 ns ±67.02% 50 ns 49 ns 34 ns 10278 ns
List1D 18.26 M 54.75 ns ±56.06% 53 ns 52 ns 42 ns 8358 ns
ProcessDictionary 17.19 M 58.18 ns ±1393.09% 52 ns 51 ns 39 ns 403837 ns
Map2D 15.79 M 63.34 ns ±25.86% 60 ns 54 ns 41 ns 388 ns
MapTupleHalfFull 10.54 M 94.87 ns ±27.72% 91 ns 89 ns 76 ns 2088 ns
MapTupleFull 10.29 M 97.16 ns ±18.01% 93 ns 89 ns 70 ns 448 ns
ETSSet 9.74 M 102.63 ns ±26.57% 100 ns 99 ns 78 ns 2629 ns
List2D 9.04 M 110.57 ns ±69.64% 105 ns 109 ns 82 ns 4597 ns
ETSOrderedSet 6.47 M 154.65 ns ±19.27% 152 ns 149 ns 118 ns 1159 ns

get(8, 8)

Name IPS Average Deviation Median Mode Minimum Maximum
Tuple2D 42.47 M 23.55 ns ±788.60% 21 ns 20 ns 7 ns 33885 ns
Tuple1D 40.98 M 24.40 ns ±725.07% 22 ns 21 ns 10 ns 34998 ns
Array1D 29.67 M 33.70 ns ±161.51% 33 ns 32 ns 21 ns 18301 ns
MapTuple 28.54 M 35.03 ns ±986.95% 32 ns 32 ns 20 ns 230336 ns
ProcessDictionary 19.71 M 50.73 ns ±1279.45% 47 ns 47 ns 34 ns 377279 ns
Array2D 17.88 M 55.92 ns ±85.10% 52 ns 51 ns 35 ns 13720 ns
Map2D 13.28 M 75.31 ns ±32.34% 73 ns 65 ns 56 ns 2259 ns
MapTupleHalfFull 12.12 M 82.53 ns ±31.49% 80 ns 80 ns 60 ns 1959 ns
ETSSet 9.90 M 101.05 ns ±16.04% 99 ns 95 ns 78 ns 701 ns
MapTupleFull 9.85 M 101.53 ns ±19.29% 99 ns 90 ns 70 ns 487 ns
ETSOrderedSet 5.59 M 178.80 ns ±41.70% 169 ns 170 ns 135 ns 4970 ns
MapTupleQuarterFull 4.09 M 244.65 ns ±16.85% 242 ns 240 ns 226 ns 9192 ns
List2D 3.76 M 265.82 ns ±35.71% 251 ns 250 ns 231 ns 9085 ns
List1D 1.38 M 724.35 ns ±10.88% 715 ns 710 ns 699 ns 9676 ns

The top 3 remain relatively unchanged. What is very illustrative to look at is List1D and List2D though. For get(0, 0) List1D vastly outperforms its 2D sibling even being closest to the top group. That is easy to explain because it basically translates to looking at the first element of the list which is very fast for a linked list. However, looking at the last element is very slow and this is what get(8, 8) translates to. All elements have to be traversed until the end is reached. As such the whole thing is almost 16 times slower for List1D. List2D is still very slow but through it’s 2-dimenstional structure it only needs to look at 18 elements instead of 81.

MapTuple vs. MapTupleQuarterFull vs. MapTupleHalfFull vs. MapTupleFull

In most scenarios, including the biggest scenario, MapTupleQuarterFull performs worse than MapTuple (expected), MapTupleHalfFull (unexpected) and MapTupleFull (unexpected). I had expected its performance to be worse than MapTuple but better than MapTupleFull and MapTupleHalfFull. Why is that?

I had no idea but Johanna had one: it might have to do with the “magic” limit at which a map “really” becomes a map and not just a list that is linearly searched. That limit is defined as 32 entries in the erlang source code (link also provided by Johanna). Our quarter full implementation is below that limit (27 entries) and hence often performance characteristics more akin to List1D (see good get(0, 0) performance but bad get(8, 8) performance) than its “real” map cousins.

To the best of my understanding this “switch the implementation at size 32” is a performance optimization. With such a small data set a linear search often performs better than the overhead introduced by hashing, looking up etc. You can also see that the trade-off pays off as in the big benchmark where the whole board is filled incrementally MapTuple (which is initially empty and grows) still provides top performance.

What I still don’t fully understand is that sometimes MapTupleFull seems to still outperform MapTupleHalfFull – but only by a very negligible margin (most notably in the “big” getting and setting full board benchmark). The difference however is so small that it doesn’t warrant further investigation I believe, unless you have an idea of course.

Performance difference of Array vs. Tuple

In the introduction I said arrays are backed by tuples – how come their performance is way worse then? Well, let’s have a look at what an array actually looks like:

iex(3)> mine = :array.new(81, default: nil)
{:array, 81, 0, nil, 100}
iex(4)> :array.set(13, :boom, mine)
{:array, 81, 0, nil,
{10, {nil, nil, nil, :boom, nil, nil, nil, nil, nil, nil}, 10, 10, 10, 10, 10,
10, 10, 10, 10}}

It cleverly doesn’t even initialize all the fields but uses some kind of length encoding saying “the value is the default value of nil for the next 100 fields” but also saving its set size limit of 81 (fun fact: these arrays can be configured to also dynamically grow!).

Once we set a value (at index 13) the representation changes showing still some length encoding “there is nothing here for the first 10 entries” but then the indexes 10..19 are expanded as a whole tuple that’s holding our value. So, to the best of my understanding arrays work by adding “stretches” of tuples the size of 10 as they need to.

In general this is a performance optimization especially making writes/updates faster as compared to huge tuples as mainly the 10-tuple holding the concrete value needs to get updated instead of the whole thing.

However, our custom tuple implementations are perfectly sized to begin with and not too huge. Moreover, their whole size being set at compile-time probably enables some optimizations (or so I believe).  Hence the tuple implementations outperform them while arrays don’t do too shabby (especially with read access) as compared to other implementations.

Conclusion

Tuples can be very good for the use case of known at compile time sized collections that need fast access and a simple flat map performs amazingly well. All that least for the relatively small board size (9×9 = 81 fields) benchmarked against here. There is a big caveat for the map though – it is so fast if we can start with an empty map and grow it in size as new pieces are set. The completely initialized map (MapTupleFull) performs way worse, tuples are the clear winners then.

Missing a data structure? Please do a PR! There’s a behaviour to implement and then just to lists to add your module name to – more details.

Update 1 (2019-06-17): Fixed MapTupleHalfFull. Before the update it was actually just quarter full 😅 which has wildly different performance characteristics for reasons now described along with the MapTupleQuarterFull implementation. Thanks goes to Johanna for pointing that out. Also the process registry has been added as another possible implementation on a suggestion from Michał 😉 . Also added a run time box plot to show outliers clearer and visually.

Update 2 (2019-06-18): Added and investigated Arrays thanks to /u/Hauleth over on reddit. Also added a remark about records thanks to /u/friendlysock over on lobste.rs.

Choosing Elixir for the Code, not the Performance

People like to argue about programming languages: “This one is better!” “No this one!”. In these discussion, often the performance card is pulled. This language is that much faster in these benchmarks or this company just needs that many servers now. Performance shouldn’t matter that much in my opinion, and Nate Berkopec makes a good point about that in his blog post “Is Ruby too slow for web scale?” (TLDR; we can add more servers and developer time often costs more than servers):

The better conversation, the more meaningful and impactful one, is which framework helps me write software faster, with more quality, and with more happiness.

I agree with lots of the points Nate makes and I like him and the post, but still it rubbed me the wrong way a bit. While it also states the above, it makes it seem like people just switch languages for the performance gains. And that brought up a topic that has been bugging me for a while: If you’re switching your main language purely for performance, there’s a high chance you’re doing it wrong. Honestly, if all we cared about was performance we’d all still be writing Assembly, or C/C++ at least.

It’s also true, that performance is often hyped a lot around new languages and specifically Elixir can also be guilty of that. And sure, performance is great and we read some amazing stories about that. Two of the most prominent adoption stories that I can recall are usually cited and referred to for their great performance numbers. There is Pinterest “our API responses are in microseconds now” and there is Bleacher Report “we went from 150 servers to 5”. If you re-read the articles though, other benefits of elixir are mentioned as well ore are even discussed more than  performance or even more.

The Pinterest article focuses first on Elixir as a good language, performance is arguably secondary in the post. Before we ever talk about microseconds there is lots of talk such as:

The language makes heavy use of pattern matching, a technique that prevents *value* errors which are much more common than *type* errors. It also has an innovative pipelining operator, which allows data to flow from one function to the next in a clear and easy to read fashion.

Then the famous microseconds drops in one paragraph, and then it immediately turns around and talks about code clarity again:

We’ve also seen an improvement in code clarity. We’re converting our notifications system from Java to Elixir. The Java version used an Actor system and weighed in at around 10,000 lines of code. The new Elixir system has shrunk this to around 1000 lines.

The Bleacher Report article has performance in its headline and at its heart, but it also mentions different benefits of Elixir:

The new language has led to cleaner code base and much less technical debt, according to Marx. It has also increased the speed of development(…)

So why do people rave about performance so much? Performance numbers are “objective”, they are “rationale”, they are impressive and people like those. It’s an easy argument to make that bleacher report uses just 5 servers instead of 150 in the old ruby stack. That’s a fact. It’s easy to remember and easy to put into a headline. Discussing the advantages of immutable data structures, pattern matching and “let it crash” philosophy is much more subjective, personal and nuanced.

Before we jump in, this blog post is general but some specific points might resonate the best with a ruby crowd as that is my main programming language/where I’m coming from. So, from other languages some of the points I’ll make will be like “meh I already got this” while I might miss out obvious cool things both Ruby and Elixir have.

Hence, after this lengthy introduction let’s focus on something different – what makes Elixir a language worth learning – how can it make day to day coding more productive in spite of performance? 

Let’s get some performance stuff out of the way first…

(The irony of starting the first section of a blog post decisively not about performance by discussing performance is not lost on me)

First, I wanna touch the topic of performance again really quickly – does it really not matter? Can we seamlessly scale horizontally? Does performance not impact productivity?

Well it certainly does, as remarked by Devon:

In a more general sense, if my runtime is already fast enough I don’t need to bother with more complex algorithms and extra concepts. I can just leave it as is. No extra engineering spent on “making it faster” – just on to the next. That’s a good thing. Especially caching can be just wonderful to debug.

What about performance of big tasks? Like Data processing, or in the case of the company I’m working for solving a vehicle routing problem1? You can’t just scale those up by throwing servers at it. You might be able to parallelize it, but that’s not too easy in Ruby and in general is often a bigger engineering effort. And some languages make that easier as well, like Elixir’s flow.

Vertical Scaling has its limits there. It works fine for serving more web requests, working on more background jobs but it gets more complicated when you have a big problem to solve that aren’t easily parallelizable especially if they need to be done wihin a given time frame.

Also, I might not be one of the cool docker + kubernetes kids, but if you tell me that there’s no overhead to managing 125 servers versus 5 servers, I tend to not believe it. If simply because the chance of anyone of your servers failing at any time is much bigger just cause you got more of them.

Well then, finally enough performance chatter in a post not about performance. Let’s look at the code and how it can make your life easier! I swear I try to keep these sections short and sweet, although admittedly that’s not exactly my strength (who would have guessed by now? 😉 )

Pattern Matching


defmodule Patterns do
def greet(%{name: name, age: age}) do
IO.puts "Hi there #{name}, what's up at #{age}?"
end
def greet(%{name: "José Valim"}) do
IO.puts "Hi José, thanks for elixir! <3"
end
def greet(%{name: name}) do
IO.puts "Hi there #{name}"
end
def greet(_) do
IO.puts "Hi"
end
end
Patterns.greet %{name: "Tobi", age: 27} # Hi there Tobi, what's up at 27?
Patterns.greet %{name: "José Valim"} # Hi José, thanks for elixir! ❤
Patterns.greet %{name: "dear Reader"} # Hi there dear Reader
Patterns.greet ["Mop"] # Hi

view raw

matching.exs

hosted with ❤ by GitHub

Pattern Matching is my single favorite feature. If I could pick a single feature to be adopted in other programming languages it would be pattern matching. I find myself writing pattern matching code in Ruby, then sighing… “Ugh right I can’t do this”. It changed the way I think.

Enough “this is soo great”. With pattern matching you basically make assertions on the structure and can get values directly out of a deeply nested map and put their value into a variable. It runs deeper than that though. You also have method overloading and elixir will try to match the functions from top to bottom which means you can have different function definitions based on the structure of your input data.

You can’t just use it on maps though. You can use it on lists as well, so you can have a separate function clause for an empty or one element list which is really great for recursion and catching edge cases.

One of the most fascinating uses I’ve seen was for parsing files as you can also use it for strings and so can separate the data and different headers of mp3 files all in just a couple of lines of elixir:


mp3_byte_size = (byte_size(binary) – 128)
<< _ :: binary-size(mp3_byte_size), id3_tag :: binary >> = binary
<< "TAG",
title :: binary-size(30),
artist :: binary-size(30),
album :: binary-size(30),
year :: binary-size(4),
comment :: binary-size(30),
_rest :: binary >> = id3_tag

view raw

mp3.ex

hosted with ❤ by GitHub

Immutable Data Structures and Pure Functions

If you’re unfamiliar with immutable data structures you might wonder how the hell one ever gets anything done? Well, you have to reassign values to the return values of functions if you wanna have any sort of change. You get pure functions, which means no side effects. The only thing that “happens” is the return value of the function. But, how does that help anyone?

Well, it means you have all your dependencies and their effect right there – there is no state to hold on which execution could depend. Everything that the function depends on is a parameter. That makes for superior understandability, debugging experience and testing.

Already months into my Elixir journey I noticed that I was seemingly much better at debugging library code than I was in Ruby. The reason, I believe, is the above. When I debug something in Ruby what a method does often depends on one or more instance variables. So, if yo wanna understand why that method “misbehaves” you gotta figure out which code sets those instance variables, which might depend on other instance variables being set and so on… Similarly a method might have the side effect of changing some instance variable. What is the effect in the end? You might never know.

With pure functions I can see all the dependencies of a function at a glance, I can see what they return and how that new return value is used in further function calls. It reads more like a straight up book and less like an interconnected net where I might not know where to start or stop looking.

The Pipeline Operator


config
|> Benchee.init
|> Benchee.system
|> Benchee.benchmark("job", fn -> magic end)
|> Benchee.measure
|> Benchee.statistics
|> Benchee.Formatters.Console.output
|> Benchee.Formatters.HTML.output

view raw

pipe.exs

hosted with ❤ by GitHub

How does a simple operator make it into this list?

Well, it’s about the code that it leads you to. The pipeline operator passes the value of the previous expression into the next function as the first argument. That gives you a couple of guidelines. First, when determining the order of arguments thinking about which one is the main data structure and putting that one first gives you a new guideline. Secondly, it leads you to a design with a main data structure per function, which can turn out really nice.

The above is an actual interface to my benchmarking library benchee. One of the design goals was to be “pipable” in elixir. This lead me to the design with a main Suite data structure in which all the important information is stored. As a result, implementing formatters is super easy as they are just a function that takes the suite and they can pick the information to take into account. Moreover, each and every one of those steps is interchangeable and well suited for plugins. As long you provide the needed data for later processing steps there is nothing stopping you from just replacing a function in that pipe with your own.

Lastly, the pipeline operator represents very well how I once learned to think about Functional Programming, it’s a transformation of inputs. The pipeline operator perfectly mirrors this, we start with some data structure and through a series of transformations we get some other data structure. We start with a configuration and end up with a complete benchmarking suite. We start with a URL and some parameters which we transform into some HTML to send to the user.

Railway Oriented Programming


with {:ok, record} <- validate_data(params),
{:ok, record} <- validate_in_other_system(record),
{:ok, record} <- Repo.insert(record) do
{:ok, record}
else
{:error, changeset} -> {:error, changeset}
end

view raw

railway.ex

hosted with ❤ by GitHub

I’d love to ramble on about Railway Oriented Programming, but there’s already good blog posts about that out there. Basically, instead of always checking if an error had already occurred earlier we can just branch out to the error track at any point.

It doesn’t seem all that magical until you use it for the first time. I remember suggesting using it to a colleague on a pull request (without ever using it before) and my colleague came back like “It’s amazing”!

It’s been a pattern in the application ever since. It goes a bit like this:

  1. Check the basic validity of data that we have (all fields present/sensible data)
  2. Validate that data with another system (business logic rules in some external service)
  3. Insert record into database

Anyone of those steps could fail, and if it fails executing the other steps makes no sense. So, as soon as a function doesn’t return {:ok, something} we error out to the error track and otherwise we stay on the happy track.

Explicit Code

The Python folks were right all along.

Implicit code feels like magic. It just works without writing any code. My controller instance variables are just present in the view? The name of my view is automatically inferred I don’t have to write anything? Such magic, many wow.

Phoenix, the most popular elixir web framework, takes another approach. You have to specify a template (which is like a Rails view) by name and explicitly pass parameters to it:


def new(conn, _params) do
changeset = User.new_changeset(%User{})
render conn, "new.html", changeset: changeset
end

No magic. What happens is right there, you can see it. No accidentally making something accessible to views (or partials) anymore! You know what else is there? The connection and the parameters so we can make use of them, and pattern match on them.

Another place where the elixir eco system is more explicit is when loading relations of a record:


iex(13)> user = Repo.get_by(User, name: "Homer")
iex(14)> user.videos
#Ecto.Association.NotLoaded<association :videos is not loaded>
iex(17)> user = Repo.preload(user, :videos)
iex(18)> user.videos
# all the videos

This is ecto, the “database access layer” in the elixir world. As you see, we have to explicitly preload associations we want to use. Seems awful bothersome, doesn’t it?

I love it!

No more N+1 queries, as Rails loads things magically for me. Also, I get more control. I know about the db queries my application fires against the database. Just the other day I fixed a severe performance degradation as our app was loading countless records from the database and instantiated them for what should have been a simple count query. Long story short, it was a presenter object so .association loaded all the objects, put them in presenters and then let my .size be executed on that. I would have never explicitly preloaded that data and hence found out much earlier that something is wrong with this.

Speaking of explicitness and ecto…

Ecto Changesets


def new_changeset(model, params \\ %{}) do
model
|> cast(params, ~w(name username))
|> validate_required(~w(name username))
|> unique_constraint(:username)
|> validate_length(:username, min: 1, max: 20)
end
def registration_changeset(model, params) do
model
|> new_changeset(params)
|> cast(params, ~w(password))
|> validate_required(~w(password))
|> validate_length(:password, min: 6, max: 100)
|> put_pass_hash()
end

view raw

changeset.ex

hosted with ❤ by GitHub

Callbacks and validations are my nemesis.

The problem with them is the topic for another topic entirely but in short, validations and callbacks are executed all the time (before save, validation, create, whatever) but lots of them are just added for one feature that is maybe used in 2 places. Think about the user password. Validating it and hashing/salting it is only ever relevant when a user registers or changes the password. But that code sits there and is executed in a not exactly trivial to determine order. Sometimes that gets in the way of new features or tests so you start to throw a bunch of ifs at it.

Ecto Changesets were one of the strangest things for me to get used to coming to Elixir. Basically they just encapsulate a change operation, saying which parameters can take part, what should be validated and other code to execute. You can also easily combine changesets, in the code above the registration_changeset uses the new_changeset and adds just password functionality on top.

I only deal with password stuff when I explicitly want to. I know which parameters were allowed to go in/change so I just need to validate those. I know exactly when what step happens so it’s easy to debug and understand.

Beautiful.

Optional Type-Checking


@type job_name :: String.t | atom
@spec benchmark(Suite.t, job_name, fun, module) :: Suite.t
def benchmark(suite = %Suite{scenarios: scenarios}, job_name, function, printer \\ Printer) do
#…
end

view raw

type_specs.ex

hosted with ❤ by GitHub

Want to try typing but not all the time? Elixir has got something for you! It’s cool, but not enough space here to explain it, dialyxir makes the dialyzer tool quite usable and it also goes beyond “just” type checking and includes other static analysis features as well. Still, in case you don’t like types it’s optional.

Parallelism

“Wait, you swore this was it about performance! This is outrageous!”

Relax. While Parallelism can be used for better performance it’s not the only thing. What’s great about parallelism in elixir/the Erlang VM in general is how low cost and seamless it is. Spawning a new process (not like an Operating System process, they are more like actors) is super easy and has a very low overhead, unlike starting a new thread. You can have millions of them on one machine, no problem.

Moreover thanks to our immutability guarantees and every process being isolated you don’t have to worry about processes messing with each other. So, first of all if I want to geocode the pick up and drop off address in parallel I can just do that with a bit of Task.async and Task.await. I’d never just trust whatever ruby gems I use for geocoding to be threadsafe due to global et. al.

How does this help? Well, I have something that is easily parallelizable and I can just do that. Benchee generates statistics for different scenarios in parallel just because I can easily do so. That’s nice, because for lots of samples it might actually take a couple of seconds per scenario.

Another point is that there’s less need for background workers. Let’s take web sockets as an example. To the best of my knowledge it is recommended to load off all bigger tasks in the communication to a background workers in an evented architecture as we’d block our thread/operating system process from handling other events. In Phoenix every connection already runs in its own elixir process which means they are already executed in parallel and doing some more work in one won’t block the others.

This ultimately makes applications easier as you don’t have to deal with background workers, off loading work etc.

OTP

Defining what OTP really is, is hard. It’s somewhat a set of tools for concurrent programming and it includes everything from an in memory database to the Dialyzer tool mentioned earlier. It is probably most notorious for its “behaviours” like supervisors that help you build concurrent and distributed systems in the famous “Let it crash! (and restart it in a known good state maybe)” philosophy.

There’s big books written about this so I’m not gonna try to explain it. Just so much, years of experience about highly available systems are in here. There is so much to learn here and change how you see programming. It might be daunting, but don’t worry. People built some nice abstractions that are better to use and often it’s the job of a library or a framework to set these up. Phoenix and ecto do this for you (web requests/database connections respectively). I’ll out myself right now: I’ve never written a Supervisor or a GenServer for production purposes. I used abstractions or relied on what my framework brought with it.

If this has gotten you interested I warmly recommend “The Little Elixir & OTP Guidebook”. It walks you through building a complete worker pool application from simple to a more complex fully featured version.

Doctests


@doc """
## Examples
iex> {:ok, pid} = Agent.start_link(fn -> 42 end)
iex> Agent.get_and_update(pid, fn(state) -> {state, state + 1} end)
42
iex> Agent.get(pid, fn(state) -> state end)
43
"""

view raw

doctest.ex

hosted with ❤ by GitHub

Imo the most underrated feature of elixir. Doc tests allow you to write iex example sessions in the documentation of a method. These will be executed during test runs and check if they still return the same values/still pass. They are also part of the awesome generated documentation. No more out of date/slightly wrong code samples in the documentation!

I have entire modules that only rely on their doctests, which is pretty awesome if you ask me. Also, contributing doc tests to libraries is a pretty great way to provide both documentation and tests. E.g. once upon a time I wanted to learn about the Agent module, but it didn’t click right away, so I made a PR to elixir with some nice doctests to help future generations.

A good language to learn

In the end, elixir is a good language to learn. It contains many great concepts that can make your coding lives easier and more enjoyable and all of that lives in a nice and accessible syntax. Even if you can’t use it at work straight away, learning these concepts will influence and improve your code. I experienced the same when I learned and read a couple of books about Clojure, I never wrote it professionally but it improved my Ruby code.

You might ask “Should we all go and write Elixir now?”. No. There are many great languages and there are no silver bullets. The eco system is still growing for instance. I’m also not a fan of rewriting all applications. Start small. See if you like it and if it works for you.

Lastly, if this has peaked your interest I have a whole talk up that focuses on the great explicit features of elixir and explains them in more detail: “Elixir & Phoenix – fast, concurrent and explicit”

edit1: Clarified that the bleacher report blog post is mostly about performance with little else

edit2: Fixed that you gotta specify the template by name, not the view

[1] It’s sort of like Traveling Salesman, put together with Knapsack and then you also need to decide what goes where. In short: a very hard problem to solve.

Benchmarking a Go AI in Ruby: CRuby vs. Rubinius vs. JRuby vs. Truffle – a year later

A little more than a year ago I published a blog post benchmarking different ruby implementations against a bot that plays Go which I wrote. Now a little than a year later (~13.5 months) let’s see how the different contestants have improved in the time passed.

This question becomes increasingly interesting as Ruby 3.0 aims to be 3 times as fast as Ruby 2.0.

As last time the benchmarks will be run on my Go bot rubykon, which has barely changed since then. The important question for Monte Carlo Tree Search (MCTS) bots is how many simulations can I run, as this improves quality of play. You can check out the old blog post for more rationale on this.

Setup

The benchmarks were run on the 16th of January 2017 with the following concrete Ruby versions (versions slightly abbreviated in the rest of the post):

  • CRuby 2.0.0p648
  • CRuby 2.2.3p173
  • Rubinius 2.5.8
  • JRuby 9.0.3.0
  • JRuby 9.0.3.0 in server mode and with invoke dynamic enabled (denoted as + id)
  • Truffleruby with master from 2015-11-08 and commit hash fd2c179, running on graalvm-jdk1.8.0
  • CRuby 2.4.0p0
  • Rubinius 3.69
  • JRuby 9.1.7.0
  • JRuby 9.1.7.0 in server mode and with invoke dynamic enabled (denoted as + id)
  • Truffleruby on truffle-head from 2016-01-16 with commit hash 4ad402a54cf, running on graal-core master from 2016-01-16 with commit hash 8f1ad406d78f2f built with a JVMCI enabled jdk8 (check out the install script)

As you might notice I prefer to say CRuby over MRI and very old versions are gone – e.g. I dropped benchmarking CRuby 1.9.x and JRuby 1.7.x. I also added CRuby 2.0 – as it is the comparison standard for Ruby 3.0. The next 5 versions are the remaining rubies from the original benchmark, the other five are their most up to date versions.

All of this is run on my Desktop PC running Linux Mint 18 (based on Ubuntu 16.04 LTS) with 16 GB of memory and an i7-4790 (3.6 GHz, 4 GHz boost). Also running on openjdk 8.


tobi@speedy ~ $ uname -a
Linux speedy 4.4.0-59-generic #80-Ubuntu SMP Fri Jan 6 17:47:47 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
tobi@speedy ~ $ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)

Full Monte Carlo Tree Search with 1000 playouts

I cut out the first benchmark from last years edition due to some trouble of getting benchmark-ips running – so we’ll stick with the more macro benchmark that performs a full Monte Carlo Tree Search using UCT on a 19×19 board doing 1000 playouts and see how fast we can get here. This is really the whole package of what we need to make fast for the Go-Bot to be fast! Th benchmark uses benchmark-avg, which I wrote to support more macro benchmarks than bencmark-ips.

The benchmarking code is quite simple:

Benchmark.avg do |benchmark|
game_state_19 = Rubykon::GameState.new Rubykon::Game.new(19)
mcts = MCTS::MCTS.new

benchmark.config warmup: 180, time: 180

benchmark.report "19x19 1_000 iterations" do
mcts.start game_state_19, 1_000
end
end

As you can see we run plenty of warmup – 3 minutes of it – and then 3 minutes of benchmarking time. So let’s see how many iterations per minute our contestants manage here:

Iterations per minute - higher is better
Iterations per minute – higher is better

As one can see, truffleruby is leading the pack by quite a margin,  followed by JRuby (but still over 2 times faster than it). Truffleruby is also an impressive 7 times faster than CRuby 2.4.0.

Of course, as the new benchmark was inspired by Ruby 3.0 aiming to be 3 times as fast as Ruby 2.0 – how are we doing? Do we maybe already have a 3 times faster Ruby? Well, there is a graph for that!

rubykon_2_speedup

As we can see JRuby 9.1.7.0 run in server mode and with invoke dynamic enabled is the first one to be 3 times faster than CRuby 2.0. Also, both the old version of truffleruby and the newest are 3 times faster than our baseline – the new one even 9 times faster! CRuby 2.4 on the other hand is at about a 14% improvement as compared to 2.0.

Another metric that intrigues me is how did the implementation improve in the time in between benchmarks, to gauge where the journey is going. Therefore, the next chart compares the newest version of a Ruby implementation benchmarked here against their older sibling from last time (Ruby 2.4.0 against 2.2.3, JRuby 9.1.7.0 vs. 9.0.3.0 etc.):

Speedup against older version (higher is better)
Speedup against older version (higher is better)

CRuby improved by about 11%, JRuby with invokedynamic about 18% while truffleruby, already leading the pack last time, managed another 2x performance improvement!

The odd one out clearly is Rubinius that only manages bout 20% of the performance of its former version (or a 5x decrease, if you will). This seemed like a setup error on my part at first, but it is not as Rubinius removed their JIT. As this benchmark is a prime example of a pretty hot loop running, the hit of removing the JIT naturally is pretty hard.

The slight decrease in JRuby performance without invokedynamic is slightly weird but it’s so small that it might as well be measurement inaccuracies.

Of course, for the data fans here is the raw table:

Ruby ipm average time (s) standard deviation Speedup to 2.0
2.0.0p648 4.54 13.22 0.44% 1
2.2.3p173 4.68 12.83 1.87% 1.0308370044
rbx-2.5.8 4.67 12.84 1.91% 1.0286343612
JRuby 9.0.3.0 7.75 7.74 0.47% 1.7070484581
JRuby 9.0.3.0 + id 12.81 4.68 0.80% 2.8215859031
truffleruby old 16.93 3.54 10.70% 3.7290748899
2.4.0p0 5.2 11.53 2.18% 1.1453744493
rbx-3.69 1.01 59.4 0.30% 0.2224669604
JRuby 9.1.7.0 7.34 8.17 2.12% 1.6167400881
JRuby 9.1.7.0 + id 15.12 3.97 0.62% 3.3303964758
truffleruby 36.65 1.64 1.25% 8.0726872247

Thoughts on different Ruby implementations

Let’s wrap this up with a couple of thoughts on the different implementations:

TruffleRuby

TruffleRuby is making steady and great progress, which I’m thoroughly impressed with. To be honest, I was wondering if its performance increased since the last benchmark as I was worried that implementing new Ruby features would lead to decreased performance. Seeing that it still managed a 2x performance improvement is mind boggling.

Raw speed is one thing, but if you’re familiar with TruffleRuby, one of the more noticable downsides is the big warmup time that it needs to do all of its fancy optimizations – so the peak performance you see here is only achieved after a certain time where it is much slower. Still, I’m happy to say that warmup also improved since last time! Where the old truffleruby, in my benchmarks, took about 101 seconds or 13 iterations to reach peak performance (hence the very long warmup time, to make sure every implementation is warm) the new one took around 52 seconds or 7 iterations. Still – the first of those warmup iterations took 27 seconds, so if you can’t deal with some warmup time to start with this might be a deal breaker.

Warmup is an important topic here – rubykon has no external dependencies so there’s not much code that needs to be JITed and also TruffleRuby can probably do its type optimizations of specific methods rather efficiently.

Of course, the team is working on that – there is a really noteworthy post about the state of TruffleRuby in early 2017. There further plans are detailed, e.g. C-extension support, improving startup time (drastically!) and running Rails.

It shall also be mentioned here, that setting up TruffleRuby took by far the most time and some bugs had crept in that needed fixing for Rubykon to run again. But after all this is a pre 1.0 project so these problems are to be expected. And with that in mind I want to explicitly thank Chris Seaton and Benoit Daloze for helping me with my setup troubles, fixing bugs and being woefully nice and responsive in general. Benoit even wrote a script to install the current graal-core master to run TruffleRuby with, which I was struggling with and which is needed at the moment to run rubykon on TruffleRuby without optfails.

JRuby

JRuby is coming along nicely, it’s the only Ruby implementation that runs this benchmark at a 3x speed of Ruby 2.0 while able to run whole Rails applications at the same time. It’s still my little personal favorite that I’d love to see more adoption of in the general ruby scene. Any time I see a talk or blog post about “Moving from ruby to the JVM for performance/Java interop” that doesn’t mention JRuby but goes straight to Java/Clojure/Scala & friends it makes me sad (nothing against those technologies though, they’re great!).

JRuby at the moment also sits sort of in the middle of CRuby and TruffleRuby in other concerns – it takes more warmup time than CRuby but a lot less than TRuffleRuby while still reaching nice peak performance. The latest release also brought along some nice performance improvements and we can only expect more of those in the future.

CRuby/MRI

CRuby is coming along nicely and steadily – we get nice improvements to the language and a 14% performance improvement over 2.0 isn’t negligible as well. It’s still a long shot from the targeted 3x. To be fair though, the team is still in the process of defining the benchmarks for which “Ruby 3×3” will be measured (current plan afaik is to have 9 of those cause 3×3 = 9). This is the ground work to start optimization work, and Ruby 3 is still far in the future with the estimated release in 2020.

Rubinius

Sadly, this is my bummer of this benchmarking round. A 5x performance decrase as compared to the previous version of this benchmark was quite surprising, as noted before this is due to the removed JIT. Comment courtesy of Brian (maintainer of Rubinus) from the issue I opened:

@PragTob the just-in-time compiler (JIT) has been removed to make way for a new interpreter and JIT infrastructure. That is the reason you’re seeing the performance degradation (and illustrates how important JIT is to making Ruby fast). The JIT was removed because it had a number of bugs and was too complicated, resulting in almost no contributors making improvements.

If I do a next version of these benchmarks and Rubinius by then doesn’t have a JIT again or some other performance improvements, then I’ll probably drop benchmarking it. It’s far behind the others as of now and if Rubinius’s goal isn’t speed but developer experience or whatever then there also isn’t much sense in benchmarking it 🙂

Final Thoughts

CRuby and JRuby did mostly what I expect them to – improve at a steady and good pace. TruffleRuby truly surprised me with 2x improvements in run time and warmup. Still a bit skeptic about warmup time when it’s running a full fledged Rails application but happy to try that out once they get there 🙂 It makes me wonder though, if I ported Rubykon to Crystal how would the performance compare to Truffle? Ah, time…

Almost forgot the usual disclaimer so here it goes: Always run your own benchmarks! This is a very CPU intensive AI problem typically solved by much more performant languages. I did it for fun and to see how far I could get. Also this benchmark most certainly isn’t indicative for performance of running a Rails application – the parts heavily used by Rails are most likely way different than what this does. E.g. we have no I/O here and little to no String operations, which play a bigger role in Rails. It might point in the right direction and speed improvements might be the same, but they don’t have to be.

Finally, this took me WAY longer than I expected to. I started this over a month ago while I still had time off. Compilation/running problems with old and very shine new rubies mostly to blame. So not sure if I’ll do this again in a year’s time – so if you’d like to see this and like this sort of thing please let me know 🙂

 

Tail Call Optimization in Elixir & Erlang – not as efficient and important as you probably think

(Automatic) Tail Call Optimization (TCO) is that great feature of Elixir and Erlang that everyone tells you about. It’s super fast, super cool and you should definitely always aim to make every recursive function tail-recursive. What if I told you that body-recursive functions can be faster and more memory efficient than their especially optimized tail-recursive counter parts?

Seems unlikely, doesn’t it? After all every beginners book will mention TCO, tell you how efficient it is and that you should definitely use it. Plus, maybe you’ve tried body-recusion before in language X and your call stack blew up or it was horrendously slow. I did and I thought tail-recursive functions were always better than body-recursive. Until one day, by accident, I wrote a none tail-recursive function (so TCO didn’t apply to it). Someone told me and eagerly I replaced it with its tail-recursive counterpart. Then, I stopped for a second and benchmarked it – the results were surprising to say the least.

Before we do a deep dive into the topic, let’s take a refresher on tail call optimization from Dave Thomas in Programming Elixir 1.2 (great book!):

(…) it ends up calling itself. In many languages, that adds a new frame to the stack. After a large number of messages, you might run out of memory.

This doesn’t happen in Elixir, as it implements tail-call optimization. If the last thing a function does is call itself, there’s no need to make the call. Instead, the runtime simply jumps back to the start of the function. If the recursive call has arguments, then these replace the original parameters.

Well, let’s get into this 🙂

Writing a map implementation

So let’s write an implementation of the map function. One will be body-recursive, one will be tail-recursive. I’ll add another tail-recursive implementation using ++ but no reverse and one that just does not reverse the list in the end. The one that doesn’t reverse the list of course isn’t functionally equivalent to the others as the elements are not in order, but if you wrote your own function and don’t care about ordering this might be for you. In an update here I also added a version where the argument order is different, for more on this see the results and edit6.


defmodule MyMap do
def map_tco(list, function) do
Enum.reverse _map_tco([], list, function)
end
defp _map_tco(acc, [head | tail], function) do
_map_tco([function.(head) | acc], tail, function)
end
defp _map_tco(acc, [], _function) do
acc
end
def map_tco_arg_order(list, function) do
Enum.reverse do_map_tco_arg_order(list, function, [])
end
defp do_map_tco_arg_order([], _function, acc) do
acc
end
defp do_map_tco_arg_order([head | tail], function, acc) do
do_map_tco_arg_order(tail, function, [function.(head) | acc])
end
def map_tco_concat(acc \\ [], list, function)
def map_tco_concat(acc, [head | tail], function) do
map_tco_concat(acc ++ [function.(head)], tail, function)
end
def map_tco_concat(acc, [], _function) do
acc
end
def map_body([], _func), do: []
def map_body([head | tail], func) do
[func.(head) | map_body(tail, func)]
end
def map_tco_no_reverse(list, function) do
_map_tco([], list, function)
end
end

view raw

my_map.ex

hosted with ❤ by GitHub

map_body here is the function I originally wrote. It is not tail-recursive as the last operation in this method is the list append operation, not the call to map_body. Comparing it to all the other implementations, I’d also argue it’s the easiest and most readable as we don’t have to care about accumulators or reversing the list.

Now that we have the code, let us benchmark the functions with benchee! Benchmark run on Elixir 1.3 with Erlang 19 on an i7-4790 on Linux Mint 17.3. Let’s just map over a large list and add one to each element of the list. We’ll also throw in the standard library implementation of map as a comparison baseline:


list = Enum.to_list(1..10_000)
map_fun = fn(i) -> i + 1 end
Benchee.run %{
"map tail-recursive with ++" =>
fn -> MyMap.map_tco_concat(list, map_fun) end,
"map with TCO reverse" =>
fn -> MyMap.map_tco(list, map_fun) end,
"stdlib map" =>
fn -> Enum.map(list, map_fun) end,
"map simple without TCO" =>
fn -> MyMap.map_body(list, map_fun) end,
"map with TCO new arg order" =>
fn -> MyMap.map_tco_arg_order(list, map_fun) end,
"map TCO no reverse" =>
fn -> MyMap.map_tco_no_reverse(list, map_fun) end
}, time: 10, warmup: 10


tobi@happy ~/github/elixir_playground $ mix run bench/tco_blog_post_detailed.exs
Benchmarking map tail-recursive with ++…
Benchmarking map TCO no reverse…
Benchmarking stdlib map…
Benchmarking map with TCO new arg order…
Benchmarking map simple without TCO…
Benchmarking map with TCO reverse…
Name ips average deviation median
map simple without TCO 6015.31 166.24μs (±6.88%) 163.00μs
stdlib map 5815.97 171.94μs (±11.29%) 163.00μs
map with TCO new arg order 5761.46 173.57μs (±10.24%) 167.00μs
map TCO no reverse 5566.08 179.66μs (±6.39%) 177.00μs
map with TCO reverse 5262.89 190.01μs (±9.98%) 182.00μs
map tail-recursive with ++ 8.66 115494.33μs (±2.86%) 113537.00μs
Comparison:
map simple without TCO 6015.31
stdlib map 5815.97 – 1.03x slower
map with TCO new arg order 5761.46 – 1.04x slower
map TCO no reverse 5566.08 – 1.08x slower
map with TCO reverse 5262.89 – 1.14x slower
map tail-recursive with ++ 8.66 – 694.73x slower

view raw

results

hosted with ❤ by GitHub

(The benchmarking was actually done by this a little more verbose but equivalent script so it generates the CSV ouput and console output on the same run using benchee’s slightly more verbose interface. Feature to make that possible with the nicer interface is planned 😉 )

For the more visual here is also a graph showcasing the results (visualized using benchee_csv):

new_map_tco
Graphing iterations per second (higher is better)

So what do we see? The body-recursive function seems to be as fast as the version from standard library. The reported values are faster, but well within the margin of error. Plus the median of the two is the same while standard deviation is higher for the standard library version. This hints at the possibility that the worse average may be through some outliers (resulting f.ex. from Garbage Collection). The tail-recursive version with ++ is VERY SLOW but that’s because appending with ++ so frequently is a bad idea as it needs to go to the end of the linked list every time around (O(n)). But that’s not the main point.

The main point is that the tail-recursive version is about 14% slower! Even the tail-recursive version that doesn’t reverse the list is slower than the body-recursive implementation!

Here seems like a good point to interject and mention that it was brought up in the comments (see edit11) that for significantly larger lists tail-recursive implementations get faster again. You can check out the results from a 10 Million item list.

What is highly irritating and surprising to me is that the tail-recursive function with a slightly different argument order is significantly faster than my original implementation, almost 10%. And this is not a one off – it is consistently faster across a number of runs. You can see more about that implementation in edit6 below. Thankfully José Valim chimed in about the argument order adding the following:

The order of arguments will likely matter when we generate the branching code. The order of arguments will specially matter if performing binary matching. The order of function clauses matter too although I am not sure if it is measurable (if the empty clause comes first or last).

Now, maybe there is a better tail-recursive version (please tell me!) but this result is rather staggering but repeatable and consistent. So, what happened here?

An apparently common misconception

That tail-recursive functions are always faster seems to be a common misconception – common enough that it made the list of Erlang Performance Myths as “Myth: Tail-Recursive Functions are Much Faster Than Recursive Functions”! (Note: this section is currently being reworked so the name might change/link might not lead to it directly any more in the near-ish future)

To quote that:

A body-recursive list function and a tail-recursive function that calls lists:reverse/1 at the end will use the same amount of memory. lists:map/2, lists:filter/2, list comprehensions, and many other recursive functions now use the same amount of space as their tail-recursive equivalents.

So, which is faster? It depends. On Solaris/Sparc, the body-recursive function seems to be slightly faster, even for lists with a lot of elements. On the x86 architecture, tail-recursion was up to about 30% faster

The topic also recently came up on the erlang-questions mailing list again while talking about the rework of the aforementioned Erlang performance myths site (which is really worth the read!). In it Fred Hebert remarks (emphasis added by me):

In cases where all your function does is build a new list (or any other accumulator whose size is equivalent to the number of iterations and hence the stack) such as map/2 over nearly any data structure or say zip/2 over lists, body recursion may not only be simpler, but also faster and save memory over time.

He also has his own blog post on the topic.

But won’t it explode?

I had the same question. From my experience with the clojure koans I expected the body-recursive function to blow up the call stack given a large enough input. But, I didn’t manage to – no matter what I tried.

Seems it is impossible as the BEAM VM, that Erlang and Elixir run in, differs in its implementation from other VMs, the body recursion is limited by RAM:

Erlang has no recursion limit. It is tail call optimised. If the recursive call is not a tail call it is limited by available RAM

Memory consumption

So what about memory consumption? Let’s create a list with one hundred million elements (100_000_000) and map over it measuring the memory consumption. When this is done the tail-recursive version takes almost 13 Gigabytes of memory while the body-recursive version takes a bit more than 11.5 Gigabytes. Details can be found in this gist.

Why is that? Well most likely here with the large list it is because the tail recursive version needs to create a new reversed version of the accumulator to return a correct result.

Body-recursive functions all the time now?

So let’s recap, the body-recursive version of map is:

  • faster
  • consumes less memory
  • easier to read and maintain

So why shouldn’t we do this every time? Well there are other examples of course. Let’s take a look at a very dumb function deciding whether a number is even (implemented as a homage to this clojure kaons exercise that showed how the call stack blows up in Clojure without recur):


defmodule Number do
def is_even?(0), do: true
def is_even?(n) do
!is_even?(n – 1)
end
def is_even_tco?(n, acc \\ true)
def is_even_tco?(0, acc), do: acc
def is_even_tco?(n, acc) do
is_even_tco?(n – 1, !acc)
end
end

view raw

number.ex

hosted with ❤ by GitHub


number = 10_000_000
Benchee.run %{
"is_even?" => fn -> Number.is_even?(number) end,
"is_even_tco?" => fn -> Number.is_even_tco?(number) end,
}


tobi@happy ~/github/elixir_playground $ mix run bench/is_even.exs
Benchmarking is_even?…
Benchmarking is_even_tco?…
Name ips average deviation median
is_even? 10.26 97449.21μs (±0.50%) 97263.00μs
is_even_tco? 9.39 106484.48μs (±0.09%) 106459.50μs
Comparison:
is_even? 10.26
is_even_tco? 9.39 – 1.09x slower

view raw

results

hosted with ❤ by GitHub

The tail-recursive version here is still 10% slower. But what about memory? Running the function with one hundred million as input takes 41 Megabyte for the tail-recursive version (mind you, this is the whole elixir process) but almost 6.7 Gigabyte for the body-recursive version. Also, for that huge input the tail-recursive version took 1.3 seconds, while the body-recursive function took 3.86 seconds. So for larger inputs, it is faster.

Stark contrast, isn’t it? That’s most likely because this time around there is no huge list to be carried around or accumulated – just a boolean and a number. Here the effect that the body-recursive function needs to save its call stack in the RAM has a much more damaging effect, as it needs to call itself one hundred million times.

Rerunning the map implementation with a significantly larger list, the tail-recursive versions also become faster than the standard library version and the body-recursive function. See edit11.

So, what now?

Tail-recursive functions still should be faster and more efficient for many or most use cases. Or that’s what I believe through years of being taught that tail call optimization leads to the fastest recursive functions ;). This post isn’t to say that TCO is bad or slow. It is here to say and highlight that there are cases where body-recursive functions are faster and more efficient than tail-recursive functions. I’m also still unsure why the tail-recursive function that does not reverse the list is still slower than the body-recursive version – it might be because it has to carry the accumulator around.

Maybe we should also take a step back in education and teaching and be more careful not to overemphasize tail call optimization and with it tail-recursive functions. Body-recursive functions can be a viable, or even superior, alternative and they should be presented as such.

There are, of course, cases where writing tail-recursive functions is absolutely vital, as Robert Virding, creator of Erlang, rightfully highlights:

No, the main case where is TCO is critical is in process top-loops. These functions never return (unless the process dies) so they will build up stack never to release it. Here you have to get it right. There are no alternatives. The same applies if you top-loop is actually composed of a set of mutually calling functions. There there are no alternatives. Sorry for pushing this again, and again, but it is critical. :slight_smile:

But what does this teach us in the end? Don’t take your assumptions stemming from other programming environments for granted. Also, don’t assume – always proof. So let’s finish with the closing words of the Erlang performance myths section on this:

So, the choice is now mostly a matter of taste. If you really do need the utmost speed, you must measure. You can no longer be sure that the tail-recursive list function always is the fastest.

edit1: there previously was a bug in is_even_tco? with a missing not, not caught by my tests as they were calling the wrong function 😦 Thanks to Puella for notifying me.

edit2/addendum: (see next edit/update) It was pointed out at lobste.rs that running it in an erlang session the body-recursive function was significantly slower than the tco version. Running what I believe to be equivalent code in Elixir and Erlang it seems that the Erlang map_body version is significantly slower than in Elixir (2 to 3 times by the looks of it). I’d need to run it with an Erlang benchmarking tool to confirm this, though.

edit3/addendum2: The small tries mentioned in the first addendum were run in the shell which is not a great idea, using my little erlang knowledge I made something that compiled that “benchmark” and map_body is as fast/faster again thread. Benchmarking can be fickle and wrong if not done right, so would still look forward to run this in a proper Erlang benchmarking tool or use Benchee from Erlang. But no time right now 😦

edit4: Added comment from Robert Virding regarding process top loops and how critical TCO is there. Thanks for reading, I’m honoured and surprised that one of the creators of Erlang read this 🙂 His full post is of course worth a read.

edit5: Following the rightful nitpick I don’t write “tail call optimized” functions any more but rather “tail-recursive” as tail call optimization is more of a feature of the compiler and not directly an attribute of the function

edit6: Included another version in the benchmark that swaps the argument order so that the list stays the first argument and the accumulator is the last argument. Surprisingly (yet again) this version is constantly faster than the other tail-recursive implementation but still slower than body recursive. I want to thank Paweł for pointing his version out in the comments. The reversed argument order was the only distinguishing factor I could make out in his version, not the assignment of the new accumulator. I benchmarked all the different variants multiple times. It is consistently faster, although I could never reproduce it being the fastest. For the memory consumption example it seemed to consume about 300MB less than the original tail-recursive function and was a bit faster. Also since I reran it either way I ran it with the freshly released Elixir 1.3 and Erlang 19. I also increased the runtime of the benchmark as well as the warmup (to 10 and 10) to get more consistent results overall. And I wrote a new benchmarking script so that the results shown in the graph are from the same as the console output.

edit7: Added a little TCO intro as it might be helpful for some 🙂

edit8: In case you are interested in what the bytecode looks like, I haven’t done it myself (not my area of expertise – yet) there is a gist from Sasa Juric showing how you can get the bytecode from an elixir function

edit9: Added José’s comment about the argument order, thanks for reading and commenting! 🙂

edit10: Added CPU and OS information 🙂

edit11: It was pointed out in the comments that the performance characteristics might differ for larger lists. And it does, taking 1000 times as many elements (10_000_000) all tail-recursive versions are significantly faster than the body-recursive version (and even the stdlib version).

edit12: Updated to reflect newer Benchee API as the old API with lists of tuples doesn’t work anymore 🙂

 

Don’t you Struct.new(…).new(…)

As I just happened upon it again, I gotta take a moment to talk about one my most despised ruby code patterns: Struct.new(...).new – ever since I happened upon it for the first time.

Struct.new?

Struct.new is a convenient way to create a class with accessors:

2.2.2 :001 > Extend = Struct.new(:start, :length)
 => Extend 
2.2.2 :002 > instance = Extend.new(10, 20)
 => #<struct Extend start=10, length=20> 
2.2.2 :003 > instance.start
 => 10 
2.2.2 :004 > instance.length
 => 20

That’s neat, isn’t it? It is, but like much of Ruby’s power it needs to be wielded with caution.

Where Struct.new goes wrong – the second new

When you do Struct.new you create an anonymous class, another new on it creates an instance of that class. Therefore, Struct.new(...).new(...) creates an anonymous class and creates an instance of it at the same time. This is bad as we create a whole class to create only one instance of it! That is a capital waste.

As a one-off use it might be okay, for instance when you put it in a constant. The sad part is, that this is not the only case I’ve seen it used. As in the first case where I encountered it, I see it used inside of code that is invoked frequently. Some sort of hot loop with calculations where more than one value is needed to represent some result of the calculation. Here programmers sometimes seem to reach for Struct.new(...).new(...).

Why is that bad?

Well it incurs an unnecessary overhead, creating the class every time is unnecessary. Not only that, as far as I understand it also creates new independent entries in the method cache. For JITed implementations (like JRuby) the methods would also be JITed independently. And it gets in the way of profiling, as you see lots of anonymous classes with only 1 to 10 method calls each.

But how bad is that performance hit? I wrote a little benchmark where an instance is created with 2 values and then those 2 values are read one time each. Once with Struct.new(...).new(...), once where the Struct.new(...) is saved in an intermediary constant. For fun and learning I threw in a similar usage with Array and Hash.

Benchmark.ips do |bm|
  bm.report "Struct.new(...).new" do
    value = Struct.new(:start, :end).new(10, 20)
    value.start
    value.end
  end

  SavedStruct = Struct.new(:start, :end)
  bm.report "SavedStruct.new" do
    value = SavedStruct.new(10, 20)
    value.start
    value.end
  end

  bm.report "2 element array" do
    value = [10, 20]
    value.first
    value.last
  end

  bm.report "Hash with 2 keys" do
    value = {start: 10, end: 20}
    value[:start]
    value[:end]
  end

  bm.compare!
end

I ran those benchmarks with CRuby 2.3. And the results, well I was surprised how huge the impact really is. The “new-new” implementation is over 33 times slower than the SavedStruct equivalent. And over 60 times slower than the fastest solution (Array), although that’s also not my preferred solution.

Struct.new(...).new    137.801k (± 3.0%) i/s -    694.375k
SavedStruct.new      4.592M (± 1.7%) i/s -     22.968M
2 element array      7.465M (± 1.4%) i/s -     37.463M
Hash with 2 keys      2.666M (± 1.6%) i/s -     13.418M
Comparison:
2 element array:  7464662.6 i/s
SavedStruct.new:  4592490.5 i/s - 1.63x slower
Hash with 2 keys:  2665601.5 i/s - 2.80x slower
Struct.new(...).new:   137801.1 i/s - 54.17x slower
Benchmark in iterations per second (higher is better)
Benchmark in iterations per second (higher is better)

But that’s not all…

This is not just about performance, though. When people take this “shortcut” they also circumvent one of the hardest problems in programming – Naming. What is that Struct with those values? Do they have any connection at all or were they just smashed together because it seemed convenient at the time. What’s missing is the identification of the core concept that these values represent. Anything that says that this is more than a clump of data with these two values, that performs very poorly.

So, please avoid using Struct.new(...).new(...) – use a better alternative. Don’t recreate a class over and over – give it a name, put it into a constant and enjoy a better understanding of the concepts and increased performance.

Benchmarking a Go AI in Ruby: CRuby vs. Rubinius vs. JRuby vs. Truffle/Graal

The world of Artificial Intelligences is often full of performance questions. How fast can I compute a value? How far can I look ahead in a tree? How many nodes can I traverse?

In Monte Carlo Tree Search one of the most defining questions is “How many simulations can I run per second?”. If you want to learn more about Monte Carlo Tree Search and its application to the board game Go I recommend you the video and slides of my talk about that topic from Rubyconf 2015.

Implementing my own AI – rubykon – in ruby of course isn’t going to get me the fastest implementation ever. It forces you to really do less and therefore make nice performance optimization, though. This isn’t about that either. Here I want to take a look at another question: “How fast can Ruby go?” Ruby is a language with surprisingly many well maintained implementations. Most prominently CRuby, Rubinius, JRuby and the newcomer JRuby + Truffle. How do they perform in this task?

The project

Rubykon is a relatively small project – right now the lib directory has less than 1200 lines of code (which includes a small benchmarking library… more on that later). It has no external runtime dependencies – not even the standard library. So it is very minimalistic and also tuned for performance.

Setup

The benchmarks were run pre the 0.3.0 rubykon version on the 8th of November (sorry writeups always take longer than you think!) with the following concrete ruby versions (versions slightly abbreviated in the rest of the post):

  • CRuby 1.9.3p551
  • CRuby 2.2.3p173
  • Rubinius 2.5.8
  • JRuby 1.7.22
  • JRuby 9.0.3.0
  • JRuby 9.0.3.0 run in server mode and with invoke dynamic enabled (denoted as + id)
  • JRuby + Truffle Graal with master from 2015-11-08 and commit hash fd2c179, running on graalvm-jdk1.8.0

You can find the raw data (performance numbers, concrete version outputs, benchmark results for different board sizes and historic benchmark results) in this file.

This was run on my pretty dated desktop PC (i7 870):


tobi@tobi-desktop ~ $ uname -a
Linux tobi-desktop 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
tobi@tobi-desktop ~ $ java -version
openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
tobi@tobi-desktop ~ $ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 30
Stepping:              5
CPU MHz:               1200.000
BogoMIPS:              5887.87
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

First benchmark: Simulation + Scoring on 19×19

This benchmark uses benchmark-ips to see how many playouts (simulation + scoring) can be done per second. This is basically the “evaluation function” of the Monte Carlo Method. Here we start with an empty board and then play random valid moves until there are no valid moves anymore and then we score the game. The performance of a MCTS AI is hugely dependent on how fast that can happen.

Benchmarks were run with a warmup time of 60 seconds and a run time of 30 seconds. The small black bars in the graph denote standard deviation. Results:

19_19_ips.png
Full 19×19 playout, iterations per second (higher is better)
Ruby Version iterations per second standard deviation
CRuby 1.9.3p551 44.952 8.90%
CRuby 2.2.3p173 55.403 7.20%
Rubinius 2.5.8 40.911 4.90%
JRuby 1.7.22 63.456 15.80%
JRuby 9.0.3.0 73.479 6.80%
JRuby 9.0.3.0 + invoke dynamic 121.265 14.00%
JRuby + Truffle 192.42 14.00%

JRuby + Truffle runs on a slightly modified version of benchmark-ips. This is done because it is a highly optimizing and speculative runtime that leads to bad results after warmup. This is explained here.

Second benchmark: Full UCT Monte Carlo Tree Search with 1000 playouts

This benchmark does a full Monte Carlo Tree Search, meaning choosing a node to investigate, doing a full simulation and scoring there and then propagating the results back in the tree before starting over again. As the performance is mostly dependent on the playouts the graph looks a lot like the one above.

This uses benchmark-avg, which I wrote myself and (for now) still lives in the rubykon repository. Why a new benchmarking library? In short: I needed something for more “macro” benchmarks that gives nice output like benchmark-ips. Also, I wanted a benchmarking tool that plays nice with Truffle – which means doing warmup and run of a benchmark directly after one another, as detailed in this issue.

This uses a warmup time of 3 minutes and a run time of 2 minutes. Along with the iterations per minute, we have another graph depicting average run time.

19_19_ipm.png
MCTS on 19×19 with 1000 playouts, iterations per minute (higher is better)
19_19_avg.png
MCTS on 19×19 with 1000 playouts, average run time (lower is better)
Ruby Version iterations per minute average time (s) standard deviation
CRuby 1.9.3p551 1.61 37.26 2.23%
CRuby 2.2.3p173 2.72 22.09 1.05%
Rubinius 2.5.8 2.1 28.52 2.59%
JRuby 1.7.22 3.94 15.23 1.61%
JRuby 9.0.3.0 3.7 16.23 2.48%
JRuby 9.0.3.0 + invoke dynamic 7.02 8.55 1.92%
JRuby + Truffle 9.49 6.32 8.33%

Results here pretty much mirror the previous benchmark, although standard deviation is smaller throughout which might be because more non random code execution is involved.

Otherwise the relative performance of the different implementations is more or less the same, with the notable exception of JRuby 1.7 performing better than 9.0 (without invoke dynamic). That could be an oddity, but it is also well within the margin of error for the first benchmark.

For the discussion below I’ll refer to this benchmark, as it ran on the same code for all implementations and has a lower standard deviation overall.

Observations

The most striking observation certainly is JRuby + Truffle/Graal sits atop in the benchmarks with a good margin. It’s not that surprising when you look at previous work done here suggesting speedups of 9x to 45x as compared to CRuby. Here the speedup relative to CRuby is “just” 3.5 which teaches us to always run your own benchmarks.

It is also worth noting that Truffle first was unexpectedly very slow (10 times slower than 1.9) so I opened an issue and reported that somewhat surprising lack in performance. Then Chris Season was quick to fix it and along the way he kept an amazing log of things he did to diagnose and make it faster. If you ever wanted to take a peek into the mind of a Ruby implementer – go ahead and read it!

At the same time I gotta say that the warmup time it takes has got me worried a bit. This is a very small application with one very hot loop (generating the valid moves). It doesn’t even use the standard library. The warmup times are rather huge exactly for Truffle and I made sure to call no other code in benchmark/avg as this might deoptimize everything again. However, it is still in an early stage and I know they are working on it 🙂

Second, “normal” JRuby is faster than CRuby which is not much of a surprise to me – in most benchmarks I do JRuby comes up ~twice as fast CRuby. So when it was only ~30% faster I was actually a bit disappointed, but then remembered the --server -Xcompile.invokedynamic=true switches and enabled them. BOOM! Almost 2.6 times faster than CRuby! Almost 90% faster than JRuby without those switches.

Now you might ask: “Why isn’t this the default?” Well, it was the default. Optimizing takes time and that slows down the startup time, for say rails, significantly which is why it was deactivated by default.

If I’m missing any of these magic switches for any of the other implementations please let me know and I’ll add them.

I’m also a bit sad to see rubinius somewhere between 1.9 and 2.2 performance wise, I had higher hopes for its performance with some appropriate warmup time.

Also opal is notably missing, I couldn’t get it to run but will try again in a next version to see what V8 can optimize here.

An important word of warning to conclude the high level look at the different implementations: These benchmarks are most likely not true for your application! Especially not for rails! Benchmark yourself 🙂

Now for another question that you probably have on your mind: “How fast is this compared to other languages/implementations?” See, that’s hard to answer. No serious Go engine does pure random playouts, they all use some heuristics slowing them down significantly. But, they are still faster. Here’s some data from this computer go thread, they all refer to the 19×19 board size:

  • it is suggested than one should be able to do at least 100 000 playouts per second without heuristics
  • With light playouts Aya did 25 000 playouts in 2008
  • well known C engine pachi does 2000 heavy playouts per thread per second

Which leads us to the question…

Is this the end of the line for Ruby?

No, there are still a couple of improvements that I have in mind that can make it much faster. How much faster? I don’t know. I have this goal of 1000 playouts on 19×19 per second per thread in mind. It’s still way behind other languages, but hey we’re talking about Ruby here 😉

Some possible improvements:

  • Move generation can still be improved a lot, instead of always looking for a new valid random moves a list of valid moves could be kept around, but it’s tricky
  • Scoring can also be done faster by leveraging neighbouring cells, but it’s not the bottleneck (yet)
  • a very clever but less accurate data structure can be used for liberty counting
  • also, of course, actually parallelize it and run on multiple threads
  • I could also use an up to date CPU for a change 😉

Other than that, I’m also looking over to the ruby implementations to get better, optimize more and make it even faster. I have especially high hopes for JRuby and JRuby + Truffle here.

So in the future I’ll try to find out how fast this can actually get, which is a fun ride and has taught me a lot so far already! You should try playing the benchmark game for yourselves 🙂

The not so low cost of calling dynamically defined methods

There are a couple of mantras that exist across programming communities, one of them is to avoid duplication or to keep it DRY. Programming languages equip us with different tools to avoid duplication. In Ruby a popular way to achieve this is Metaprogramming. Methods are dynamically defined to get rid off all duplication and we rejoice – yay! There might be other problems with metaprogrammed solutions, but at least we are sure that the performance is the same as if we’d had written that duplicated code. Or are we?

As the title suggests this post examines the performance of these meta programmed methodsIf you are looking for method definition performance or pure call overhead you’ll find this information in this post by Aaron Patterson.

Before we get into the details I want to quickly highlight that this is not some theoretical micro benchmark I pulled out of thin air. These examples are derived from performance improvements on an actual project. That work was done by my friend Jason R. Clark on this pull request over at Shoes 4. As he doesn’t have time to write it up – I get to, so let’s get to it!

Let’s look at some methods!

(Please be aware that this example is simplified, of course the real code has varying parts most importantly the name of the instance_variable which is the reason why the code was meta programmed in the first place)

class FakeDimension
  def initialize(margin_start)
    @margin_start = margin_start
    @margin_start_relative = relative? @margin_start
  end

  def relative?(result)
    result.is_a?(Float) && result <= 1
  end

  def calculate_relative(result)
    (result * 100).to_i
  end

  define_method :full_meta do
    instance_variable_name = '@' + :margin_start.to_s
    value = instance_variable_get(instance_variable_name)
    value = calculate_relative value if relative? value
    value
  end

  IVAR_NAME = "@margin_start"
  define_method :hoist_ivar_name do
    value = instance_variable_get(IVAR_NAME)
    value = calculate_relative value if relative? value
    value
  end

  define_method :direct_ivar do
    value = @margin_start
    value = calculate_relative value if relative? value
    value
  end

  eval <<-CODE
  def full_string
    value = @margin_start
    value = calculate_relative value if relative? value
    value
  end
  CODE

  def full_direct
    value = @margin_start
    value = calculate_relative value if relative? value
    value
  end
end

Starting at the first define_method these are all more or less the same method. We start at a fully meta programmed version, that even converts a symbol to an instance variable name, and end with the directly defined method without any meta programming. Now with all these methods being so similar you’d expect them all to have roughly the same performance characteristics, right? Well, such is not the case as demonstrated by the following benchmark. I benchmarked these methods both for the case where the value is relative and for when it is not. The results are not too different – see the gist for details. Running the non relative version on CRuby 2.2.2 with benchmark-ips I get the following results (higher is better):

           full_meta      1.840M (± 3.0%) i/s -      9.243M
     hoist_ivar_name      3.147M (± 3.3%) i/s -     15.813M
         direct_ivar      5.288M (± 3.1%) i/s -     26.553M
         full_string      6.034M (± 3.2%) i/s -     30.179M
         full_direct      5.955M (± 3.2%) i/s -     29.807M

Comparison:
         full_string:  6033829.1 i/s
         full_direct:  5954626.6 i/s - 1.01x slower
         direct_ivar:  5288105.5 i/s - 1.14x slower
     hoist_ivar_name:  3146595.7 i/s - 1.92x slower
           full_meta:  1840087.6 i/s - 3.28x slower

And look at that, the full_meta version is over 3 times slower than the directly defined method! Of course direct_ivar is also pretty close, but it’s an unrealistic scenario as the instance variable name is what is really changing. You can interpolate the string of the method definition in the full_string version, though. This achieves results as if the method had been directly defined. But what’s happening here?

It seems that there is a higher than expected cost associated with calling instance_variable_get, creating the necessary string and calling methods defined by define_method overall. If you want to keep the full performance but still alter the code you have to resort to the evil eval and stitch your code together in string interpolation. Yay.

So what, do we all have to eval method definitions for metaprogramming now?

Thankfully no. The performance overhead is constant – if your method does more expensive calculations the overhead diminishes. This is the somewhat rare case of a method that doesn’t do much (even the slowest version can be executed almost 2 Million times per second) but is called a lot. It is one of the core methods when positioning UI objects in Shoes. Obviously we should also do the hard work and try not to call that method that often, we’re working on that and already made some nice progress. But, to quote Jason, “regardless what we do I think that Dimension is bound to always be in our hot path.”.

What about methods that do more though? Let’s take a look at an example where we have an object that has an array set as an instance variable and has a method that concatenates another array and sorts the result (full gist):

class Try

  def initialize(array)
    @array = array
  end

  define_method :meta_concat_sort do |array|
    value = instance_variable_get '@' + :array.to_s
    new_array = value + array
    new_array.sort
  end

  def concat_sort(array)
    new_array = @array + array
    new_array.sort
  end
end

We then benchmark those two methods with the same base array but two differently sized input arrays:

BASE_ARRAY        = [8, 2, 400, -4, 77]
SMALL_INPUT_ARRAY = [1, 88, -7, 2, 133]
BIG_INPUT_ARRAY   = (1..100).to_a.shuffle

What’s the result?

Small input array
Calculating -------------------------------------
    meta_concat_sort    62.808k i/100ms
         concat_sort    86.143k i/100ms
-------------------------------------------------
    meta_concat_sort    869.940k (± 1.4%) i/s -      4.397M
         concat_sort      1.349M (± 2.6%) i/s -      6.805M

Comparison:
         concat_sort:  1348894.9 i/s
    meta_concat_sort:   869940.1 i/s - 1.55x slower

Big input array
Calculating -------------------------------------
    meta_concat_sort    18.872k i/100ms
         concat_sort    20.745k i/100ms
-------------------------------------------------
    meta_concat_sort    205.402k (± 2.7%) i/s -      1.038M
         concat_sort    231.637k (± 2.5%) i/s -      1.162M

Comparison:
         concat_sort:   231636.7 i/s
    meta_concat_sort:   205402.2 i/s - 1.13x slower

With the small input array the dynamically defined method is still over 50% slower than the non meta programmed method! When we have the big input array (100 elements) the meta programmed method is still 13% slower, which I still consider very significant.

I ran these with CRuby 2.2.2, in case you are wondering if this is implementation specific. I ran the same benchmark with JRuby and got comparable results, albeit the fact that JRuby is usually 1.2 to 2 times faster than CRuby, but the slowdowns were about the same.

So in the end, what does it mean? Always benchmark. Don’t blindly optimize calls like these as in the grand scheme of things they might not make a difference. This will only be really important for you if a method gets called a lot. If it is in your library/application, then replacing the meta programmed method definitions might yield surprising performance improvements.

UPDATE 1: Shortly after this post was published coincidentally JRuby 9.0.0.3.0 was released with improvements to the call speed of methods defined by define_method. I added the benchmarks to the comments of the gist. It is 7-15% faster for full_meta and hoist_ivar_name but now the direct_ivar is about as fast as its full_meta and full_string counterparts thanks to the optimizations!

UPDATE 2: I wrote a small benchmark about what I think is the bottle neck here – instance_variable_get. It is missing the slowest case but is still up to 3 times slower than the direct access.