Tail Call Optimization in Elixir & Erlang – not as efficient and important as you probably think

(Automatic) Tail Call Optimization (TCO) is that great feature of Elixir and Erlang that everyone tells you about. It’s super fast, super cool and you should definitely always aim to make every recursive function tail-recursive. What if I told you that body-recursive functions can be faster and more memory efficient than their especially optimized tail-recursive counter parts?

Seems unlikely, doesn’t it? After all every beginners book will mention TCO, tell you how efficient it is and that you should definitely use it. Plus, maybe you’ve tried body-recusion before in language X and your call stack blew up or it was horrendously slow. I did and I thought tail-recursive functions were always better than body-recursive. Until one day, by accident, I wrote a none tail-recursive function (so TCO didn’t apply to it). Someone told me and eagerly I replaced it with its tail-recursive counterpart. Then, I stopped for a second and benchmarked it – the results were surprising to say the least.

Before we do a deep dive into the topic, let’s take a refresher on tail call optimization from Dave Thomas in Programming Elixir 1.2 (great book!):

(…) it ends up calling itself. In many languages, that adds a new frame to the stack. After a large number of messages, you might run out of memory.

This doesn’t happen in Elixir, as it implements tail-call optimization. If the last thing a function does is call itself, there’s no need to make the call. Instead, the runtime simply jumps back to the start of the function. If the recursive call has arguments, then these replace the original parameters.

Well, let’s get into this 🙂

Writing a map implementation

So let’s write an implementation of the map function. One will be body-recursive, one will be tail-recursive. I’ll add another tail-recursive implementation using ++ but no reverse and one that just does not reverse the list in the end. The one that doesn’t reverse the list of course isn’t functionally equivalent to the others as the elements are not in order, but if you wrote your own function and don’t care about ordering this might be for you. In an update here I also added a version where the argument order is different, for more on this see the results and edit6.


defmodule MyMap do
def map_tco(list, function) do
Enum.reverse _map_tco([], list, function)
end
defp _map_tco(acc, [head | tail], function) do
_map_tco([function.(head) | acc], tail, function)
end
defp _map_tco(acc, [], _function) do
acc
end
def map_tco_arg_order(list, function) do
Enum.reverse do_map_tco_arg_order(list, function, [])
end
defp do_map_tco_arg_order([], _function, acc) do
acc
end
defp do_map_tco_arg_order([head | tail], function, acc) do
do_map_tco_arg_order(tail, function, [function.(head) | acc])
end
def map_tco_concat(acc \\ [], list, function)
def map_tco_concat(acc, [head | tail], function) do
map_tco_concat(acc ++ [function.(head)], tail, function)
end
def map_tco_concat(acc, [], _function) do
acc
end
def map_body([], _func), do: []
def map_body([head | tail], func) do
[func.(head) | map_body(tail, func)]
end
def map_tco_no_reverse(list, function) do
_map_tco([], list, function)
end
end

view raw

my_map.ex

hosted with ❤ by GitHub

map_body here is the function I originally wrote. It is not tail-recursive as the last operation in this method is the list append operation, not the call to map_body. Comparing it to all the other implementations, I’d also argue it’s the easiest and most readable as we don’t have to care about accumulators or reversing the list.

Now that we have the code, let us benchmark the functions with benchee! Benchmark run on Elixir 1.3 with Erlang 19 on an i7-4790 on Linux Mint 17.3. Let’s just map over a large list and add one to each element of the list. We’ll also throw in the standard library implementation of map as a comparison baseline:


list = Enum.to_list(1..10_000)
map_fun = fn(i) -> i + 1 end
Benchee.run %{
"map tail-recursive with ++" =>
fn -> MyMap.map_tco_concat(list, map_fun) end,
"map with TCO reverse" =>
fn -> MyMap.map_tco(list, map_fun) end,
"stdlib map" =>
fn -> Enum.map(list, map_fun) end,
"map simple without TCO" =>
fn -> MyMap.map_body(list, map_fun) end,
"map with TCO new arg order" =>
fn -> MyMap.map_tco_arg_order(list, map_fun) end,
"map TCO no reverse" =>
fn -> MyMap.map_tco_no_reverse(list, map_fun) end
}, time: 10, warmup: 10


tobi@happy ~/github/elixir_playground $ mix run bench/tco_blog_post_detailed.exs
Benchmarking map tail-recursive with ++…
Benchmarking map TCO no reverse…
Benchmarking stdlib map…
Benchmarking map with TCO new arg order…
Benchmarking map simple without TCO…
Benchmarking map with TCO reverse…
Name ips average deviation median
map simple without TCO 6015.31 166.24μs (±6.88%) 163.00μs
stdlib map 5815.97 171.94μs (±11.29%) 163.00μs
map with TCO new arg order 5761.46 173.57μs (±10.24%) 167.00μs
map TCO no reverse 5566.08 179.66μs (±6.39%) 177.00μs
map with TCO reverse 5262.89 190.01μs (±9.98%) 182.00μs
map tail-recursive with ++ 8.66 115494.33μs (±2.86%) 113537.00μs
Comparison:
map simple without TCO 6015.31
stdlib map 5815.97 – 1.03x slower
map with TCO new arg order 5761.46 – 1.04x slower
map TCO no reverse 5566.08 – 1.08x slower
map with TCO reverse 5262.89 – 1.14x slower
map tail-recursive with ++ 8.66 – 694.73x slower

view raw

results

hosted with ❤ by GitHub

(The benchmarking was actually done by this a little more verbose but equivalent script so it generates the CSV ouput and console output on the same run using benchee’s slightly more verbose interface. Feature to make that possible with the nicer interface is planned 😉 )

For the more visual here is also a graph showcasing the results (visualized using benchee_csv):

new_map_tco
Graphing iterations per second (higher is better)

So what do we see? The body-recursive function seems to be as fast as the version from standard library. The reported values are faster, but well within the margin of error. Plus the median of the two is the same while standard deviation is higher for the standard library version. This hints at the possibility that the worse average may be through some outliers (resulting f.ex. from Garbage Collection). The tail-recursive version with ++ is VERY SLOW but that’s because appending with ++ so frequently is a bad idea as it needs to go to the end of the linked list every time around (O(n)). But that’s not the main point.

The main point is that the tail-recursive version is about 14% slower! Even the tail-recursive version that doesn’t reverse the list is slower than the body-recursive implementation!

Here seems like a good point to interject and mention that it was brought up in the comments (see edit11) that for significantly larger lists tail-recursive implementations get faster again. You can check out the results from a 10 Million item list.

What is highly irritating and surprising to me is that the tail-recursive function with a slightly different argument order is significantly faster than my original implementation, almost 10%. And this is not a one off – it is consistently faster across a number of runs. You can see more about that implementation in edit6 below. Thankfully José Valim chimed in about the argument order adding the following:

The order of arguments will likely matter when we generate the branching code. The order of arguments will specially matter if performing binary matching. The order of function clauses matter too although I am not sure if it is measurable (if the empty clause comes first or last).

Now, maybe there is a better tail-recursive version (please tell me!) but this result is rather staggering but repeatable and consistent. So, what happened here?

An apparently common misconception

That tail-recursive functions are always faster seems to be a common misconception – common enough that it made the list of Erlang Performance Myths as “Myth: Tail-Recursive Functions are Much Faster Than Recursive Functions”! (Note: this section is currently being reworked so the name might change/link might not lead to it directly any more in the near-ish future)

To quote that:

A body-recursive list function and a tail-recursive function that calls lists:reverse/1 at the end will use the same amount of memory. lists:map/2, lists:filter/2, list comprehensions, and many other recursive functions now use the same amount of space as their tail-recursive equivalents.

So, which is faster? It depends. On Solaris/Sparc, the body-recursive function seems to be slightly faster, even for lists with a lot of elements. On the x86 architecture, tail-recursion was up to about 30% faster

The topic also recently came up on the erlang-questions mailing list again while talking about the rework of the aforementioned Erlang performance myths site (which is really worth the read!). In it Fred Hebert remarks (emphasis added by me):

In cases where all your function does is build a new list (or any other accumulator whose size is equivalent to the number of iterations and hence the stack) such as map/2 over nearly any data structure or say zip/2 over lists, body recursion may not only be simpler, but also faster and save memory over time.

He also has his own blog post on the topic.

But won’t it explode?

I had the same question. From my experience with the clojure koans I expected the body-recursive function to blow up the call stack given a large enough input. But, I didn’t manage to – no matter what I tried.

Seems it is impossible as the BEAM VM, that Erlang and Elixir run in, differs in its implementation from other VMs, the body recursion is limited by RAM:

Erlang has no recursion limit. It is tail call optimised. If the recursive call is not a tail call it is limited by available RAM

Memory consumption

So what about memory consumption? Let’s create a list with one hundred million elements (100_000_000) and map over it measuring the memory consumption. When this is done the tail-recursive version takes almost 13 Gigabytes of memory while the body-recursive version takes a bit more than 11.5 Gigabytes. Details can be found in this gist.

Why is that? Well most likely here with the large list it is because the tail recursive version needs to create a new reversed version of the accumulator to return a correct result.

Body-recursive functions all the time now?

So let’s recap, the body-recursive version of map is:

  • faster
  • consumes less memory
  • easier to read and maintain

So why shouldn’t we do this every time? Well there are other examples of course. Let’s take a look at a very dumb function deciding whether a number is even (implemented as a homage to this clojure kaons exercise that showed how the call stack blows up in Clojure without recur):


defmodule Number do
def is_even?(0), do: true
def is_even?(n) do
!is_even?(n – 1)
end
def is_even_tco?(n, acc \\ true)
def is_even_tco?(0, acc), do: acc
def is_even_tco?(n, acc) do
is_even_tco?(n – 1, !acc)
end
end

view raw

number.ex

hosted with ❤ by GitHub


number = 10_000_000
Benchee.run %{
"is_even?" => fn -> Number.is_even?(number) end,
"is_even_tco?" => fn -> Number.is_even_tco?(number) end,
}


tobi@happy ~/github/elixir_playground $ mix run bench/is_even.exs
Benchmarking is_even?…
Benchmarking is_even_tco?…
Name ips average deviation median
is_even? 10.26 97449.21μs (±0.50%) 97263.00μs
is_even_tco? 9.39 106484.48μs (±0.09%) 106459.50μs
Comparison:
is_even? 10.26
is_even_tco? 9.39 – 1.09x slower

view raw

results

hosted with ❤ by GitHub

The tail-recursive version here is still 10% slower. But what about memory? Running the function with one hundred million as input takes 41 Megabyte for the tail-recursive version (mind you, this is the whole elixir process) but almost 6.7 Gigabyte for the body-recursive version. Also, for that huge input the tail-recursive version took 1.3 seconds, while the body-recursive function took 3.86 seconds. So for larger inputs, it is faster.

Stark contrast, isn’t it? That’s most likely because this time around there is no huge list to be carried around or accumulated – just a boolean and a number. Here the effect that the body-recursive function needs to save its call stack in the RAM has a much more damaging effect, as it needs to call itself one hundred million times.

Rerunning the map implementation with a significantly larger list, the tail-recursive versions also become faster than the standard library version and the body-recursive function. See edit11.

So, what now?

Tail-recursive functions still should be faster and more efficient for many or most use cases. Or that’s what I believe through years of being taught that tail call optimization leads to the fastest recursive functions ;). This post isn’t to say that TCO is bad or slow. It is here to say and highlight that there are cases where body-recursive functions are faster and more efficient than tail-recursive functions. I’m also still unsure why the tail-recursive function that does not reverse the list is still slower than the body-recursive version – it might be because it has to carry the accumulator around.

Maybe we should also take a step back in education and teaching and be more careful not to overemphasize tail call optimization and with it tail-recursive functions. Body-recursive functions can be a viable, or even superior, alternative and they should be presented as such.

There are, of course, cases where writing tail-recursive functions is absolutely vital, as Robert Virding, creator of Erlang, rightfully highlights:

No, the main case where is TCO is critical is in process top-loops. These functions never return (unless the process dies) so they will build up stack never to release it. Here you have to get it right. There are no alternatives. The same applies if you top-loop is actually composed of a set of mutually calling functions. There there are no alternatives. Sorry for pushing this again, and again, but it is critical. :slight_smile:

But what does this teach us in the end? Don’t take your assumptions stemming from other programming environments for granted. Also, don’t assume – always proof. So let’s finish with the closing words of the Erlang performance myths section on this:

So, the choice is now mostly a matter of taste. If you really do need the utmost speed, you must measure. You can no longer be sure that the tail-recursive list function always is the fastest.

edit1: there previously was a bug in is_even_tco? with a missing not, not caught by my tests as they were calling the wrong function 😦 Thanks to Puella for notifying me.

edit2/addendum: (see next edit/update) It was pointed out at lobste.rs that running it in an erlang session the body-recursive function was significantly slower than the tco version. Running what I believe to be equivalent code in Elixir and Erlang it seems that the Erlang map_body version is significantly slower than in Elixir (2 to 3 times by the looks of it). I’d need to run it with an Erlang benchmarking tool to confirm this, though.

edit3/addendum2: The small tries mentioned in the first addendum were run in the shell which is not a great idea, using my little erlang knowledge I made something that compiled that “benchmark” and map_body is as fast/faster again thread. Benchmarking can be fickle and wrong if not done right, so would still look forward to run this in a proper Erlang benchmarking tool or use Benchee from Erlang. But no time right now 😦

edit4: Added comment from Robert Virding regarding process top loops and how critical TCO is there. Thanks for reading, I’m honoured and surprised that one of the creators of Erlang read this 🙂 His full post is of course worth a read.

edit5: Following the rightful nitpick I don’t write “tail call optimized” functions any more but rather “tail-recursive” as tail call optimization is more of a feature of the compiler and not directly an attribute of the function

edit6: Included another version in the benchmark that swaps the argument order so that the list stays the first argument and the accumulator is the last argument. Surprisingly (yet again) this version is constantly faster than the other tail-recursive implementation but still slower than body recursive. I want to thank Paweł for pointing his version out in the comments. The reversed argument order was the only distinguishing factor I could make out in his version, not the assignment of the new accumulator. I benchmarked all the different variants multiple times. It is consistently faster, although I could never reproduce it being the fastest. For the memory consumption example it seemed to consume about 300MB less than the original tail-recursive function and was a bit faster. Also since I reran it either way I ran it with the freshly released Elixir 1.3 and Erlang 19. I also increased the runtime of the benchmark as well as the warmup (to 10 and 10) to get more consistent results overall. And I wrote a new benchmarking script so that the results shown in the graph are from the same as the console output.

edit7: Added a little TCO intro as it might be helpful for some 🙂

edit8: In case you are interested in what the bytecode looks like, I haven’t done it myself (not my area of expertise – yet) there is a gist from Sasa Juric showing how you can get the bytecode from an elixir function

edit9: Added José’s comment about the argument order, thanks for reading and commenting! 🙂

edit10: Added CPU and OS information 🙂

edit11: It was pointed out in the comments that the performance characteristics might differ for larger lists. And it does, taking 1000 times as many elements (10_000_000) all tail-recursive versions are significantly faster than the body-recursive version (and even the stdlib version).

edit12: Updated to reflect newer Benchee API as the old API with lists of tuples doesn’t work anymore 🙂

 

20 thoughts on “Tail Call Optimization in Elixir & Erlang – not as efficient and important as you probably think

  1. Hey!

    Thanks for your interesting post! It was really nice to read! These are interesting findings, however, I’m afraid, what in your solution was described as Tail Recursion Optimised solution in fact not really is:

      defp _map_tco(acc, [head | tail], function) do
        _map_tco([function.(head) | acc], tail, function)
      end
    

    In here, the last executed term is in fact call to the function, but unfortunately, within this call you’re performing the calculation (`function.(head)`) as well. The function will be TCO if the function call is the last one executed, and the only one executed.

    Please, take a look at this solution:

      def my_map(list, fun) do
        do_my_map(list, fun, [])
      end
      def do_my_map([], _fun, acc) do
        Enum.reverse acc
      end
      def do_my_map([head | tail], fun, acc) do
        new_acc = [fun.(head) | acc]
        do_my_map(tail, fun, new_acc)
      end
    

    If you take a look at last function that performs computation (`new_acc = [fun.(head) | acc]`), and with the new value, performs the call of the function (`do_my_map(tail, fun, new_acc)`), the results on my machine are as follows:

    ➜  tco_bench mix run lib/check.exs
    Compiled lib/tco_bench.ex
    Benchmarking map with TCO reverse...
    Benchmarking map with TCO and ++...
    Benchmarking map simple without TCO...
    Benchmarking map TCO no reverse...
    Benchmarking my_map...
    Benchmarking stdlib map...
    
    Name                          ips            average        deviation      median
    map simple without TCO        3779.45        264.59μs       (±18.86%)      247.00μs
    my_map                        3770.22        265.24μs       (±18.61%)      247.00μs
    stdlib map                    3730.08        268.09μs       (±19.84%)      252.00μs
    map TCO no reverse            3484.50        286.99μs       (±18.92%)      266.00μs
    map with TCO reverse          3006.32        332.63μs       (±19.99%)      322.00μs
    map with TCO and ++           3.99           250499.84μs    (±5.98%)       251028.00μs
    
    Comparison:
    map simple without TCO        3779.45
    my_map                        3770.22         - 1.00x slower
    stdlib map                    3730.08         - 1.01x slower
    map TCO no reverse            3484.50         - 1.08x slower
    map with TCO reverse          3006.32         - 1.26x slower
    map with TCO and ++           3.99            - 946.75x slower
    
    ➜  tco_bench mix run lib/check.exs
    Benchmarking map with TCO reverse...
    Benchmarking map with TCO and ++...
    Benchmarking map simple without TCO...
    Benchmarking map TCO no reverse...
    Benchmarking my_map...
    Benchmarking stdlib map...
    
    Name                          ips            average        deviation      median
    map simple without TCO        3833.15        260.88μs       (±18.53%)      245.00μs
    stdlib map                    3805.99        262.74μs       (±18.94%)      250.00μs
    my_map                        3771.31        265.16μs       (±19.16%)      247.00μs
    map TCO no reverse            3461.84        288.86μs       (±20.13%)      267.00μs
    map with TCO reverse          3086.25        324.02μs       (±17.14%)      321.00μs
    map with TCO and ++           3.95           253440.21μs    (±6.24%)       250962.00μs
    
    Comparison:
    map simple without TCO        3833.15
    stdlib map                    3805.99         - 1.01x slower
    my_map                        3771.31         - 1.02x slower
    map TCO no reverse            3461.84         - 1.11x slower
    map with TCO reverse          3086.25         - 1.24x slower
    map with TCO and ++           3.95            - 971.47x slower
    
    ➜  tco_bench mix run lib/check.exs
    Benchmarking map with TCO reverse...
    Benchmarking map with TCO and ++...
    Benchmarking map simple without TCO...
    Benchmarking map TCO no reverse...
    Benchmarking my_map...
    Benchmarking stdlib map...
    
    Name                          ips            average        deviation      median
    map simple without TCO        3834.41        260.80μs       (±18.37%)      245.00μs
    my_map                        3812.49        262.30μs       (±17.96%)      247.00μs
    stdlib map                    3799.19        263.21μs       (±18.56%)      250.00μs
    map TCO no reverse            3485.17        286.93μs       (±19.02%)      267.00μs
    map with TCO reverse          3041.72        328.76μs       (±18.70%)      322.00μs
    map with TCO and ++           4.09           244627.10μs    (±5.48%)       242761.00μs
    
    Comparison:
    map simple without TCO        3834.41
    my_map                        3812.49         - 1.01x slower
    stdlib map                    3799.19         - 1.01x slower
    map TCO no reverse            3485.17         - 1.10x slower
    map with TCO reverse          3041.72         - 1.26x slower
    map with TCO and ++           4.09            - 938.00x slower
    

    In fact, the solution without `TCO` appears on the top of the result list, but in two of three tries, `my_map` is almost as performant.

    Similarly, `my_is_even?`:

      def my_is_even?(n) do
        my_is_even?(n, true)
      end
      def my_is_even?(0, acc) do
        acc
      end
      def my_is_even?(n, acc) do
        new_n = n - 1
        new_acc = !acc
        my_is_even?(new_n, new_acc)
      end
    

    With results:

    ➜  tco_bench mix run lib/check_number.exs
    Compiled lib/tco_bench.ex
    Benchmarking is_even?...
    Benchmarking is_even_tco?...
    Benchmarking my_is_even?...
    
    Name                          ips            average        deviation      median
    is_even?                      7.04           141987.93μs    (±3.05%)       142986.00μs
    my_is_even?                   6.81           146892.47μs    (±3.60%)       145488.50μs
    is_even_tco?                  6.78           147587.36μs    (±3.16%)       146817.00μs
    
    Comparison:
    is_even?                      7.04
    my_is_even?                   6.81            - 1.03x slower
    is_even_tco?                  6.78            - 1.04x slower
    
    ➜  tco_bench mix run lib/check_number.exs
    Benchmarking is_even?...
    Benchmarking is_even_tco?...
    Benchmarking my_is_even?...
    
    Name                          ips            average        deviation      median
    is_even?                      7.04           142105.27μs    (±2.63%)       141226.50μs
    my_is_even?                   6.90           144946.26μs    (±3.76%)       144587.50μs
    is_even_tco?                  6.85           145950.56μs    (±4.11%)       144922.50μs
    
    Comparison:
    is_even?                      7.04
    my_is_even?                   6.90            - 1.02x slower
    is_even_tco?                  6.85            - 1.03x slower
    
    ➜  tco_bench mix run lib/check_number.exs
    Benchmarking my_is_even?...
    Benchmarking is_even?...
    Benchmarking is_even_tco?...
    
    Name                          ips            average        deviation      median
    my_is_even?                   7.00           142768.66μs    (±3.36%)       143294.00μs
    is_even?                      6.97           143423.80μs    (±3.78%)       142783.50μs
    is_even_tco?                  6.88           145270.09μs    (±3.68%)       144500.50μs
    
    Comparison:
    my_is_even?                   7.00
    is_even?                      6.97            - 1.00x slower
    is_even_tco?                  6.88            - 1.02x slower
    

    In fact, I haven’t noticed any significant difference with (actually, the results vary a bit depending on how many times the script is run):

      def my_is_even?(n, acc) do
        new_n = n - 1
        new_acc = !acc
        my_is_even?(new_n, new_acc)
      end
    

    or just _flipping_ the `acc` result in the function call:

      def my_is_even?(n, acc) do
        new_n = n - 1
        my_is_even?(new_n, !acc)
      end
    

    Again – thanks for your article! It’s nice read, and you made me to stop, think and in fact – measure on my own. This is invaluable lesson!

    Best regards!

    edit1 by PragTob: replaced github markdown style code blocks with code blocks from wordpress for better readability

    1. I attempted to recreate your results on an Intel i7-5390K@3.50GHz Windows 10 (1607) platform with Elixir 1.3.4 Erlang/OTP 19 (It would be useful if you would state your benchmark platform in your article).

      You code, as written here, gives me similar results but they are not very stable. A little research into Benchee finds that it is statistically unstable when the time measurements are in microseconds since the smallest measured time unit is the microsecond. It is recommended that tests be run that take at least 100s of milliseconds. When I increase the array size to 100_000_000 that take seconds to complete, I then get [b]very[/b] stable results that mostly conform to what is expected from the different types of recursion loops.

      The TCOs are faster and, surprisingly, the stdlib function is significantly slower. This might be due to the fact that the current Elixir Enum.map source, when is_list(enumerable), uses a list comprehension to perform the map. While there has obviously been some work done on Elixir since you posted this, it seems that you do need to vastly increase the array size for these tests to allow Benchee to statistically stabilize.

      Erlang/OTP 19 [erts-8.0] [64-bit] [smp:12:12] [async-threads:10]
      Elixir 1.3.4
      Benchmark suite executing with the following configuration:
      warmup: 10.0s
      time: 10.0s
      parallel: 1
      Estimated total run time: 120.0s
      
      Benchmarking map TCO no reverse...
      Benchmarking map simple without TCO...
      Benchmarking map tail-recursive with ++...
      Benchmarking map with TCO new arg order...
      Benchmarking map with TCO reverse...
      Benchmarking stdlib map...
      
      Name                                 ips        average  deviation         median
      map TCO no reverse                  0.33         3.06 s    ±23.57%         3.11 s
      map with TCO reverse                0.26         3.84 s    ±28.88%         3.84 s
      map with TCO new arg order          0.26         3.91 s    ±18.79%         3.91 s
      map tail-recursive with ++        0.0918        10.90 s    ±12.83%        10.90 s
      stdlib map                        0.0910        10.99 s    ±11.87%        10.99 s
      map simple without TCO            0.0899        11.13 s    ±13.20%        11.13 s
      
      Comparison:
      map TCO no reverse                  0.33
      map with TCO reverse                0.26 - 1.26x slower
      map with TCO new arg order          0.26 - 1.28x slower
      map tail-recursive with ++        0.0918 - 3.56x slower
      stdlib map                        0.0910 - 3.59x slower
      map simple without TCO            0.0899 - 3.63x slower
      

      The main problem I’ve always had with body vs. tail recursions is that, within the context of the language semantics, there should be no performance difference at all. If you think about the map function from the viewpoint of the functional language, both versions do exactly the same thing.
      1. Enumerate the list from start to end, storing each mapped value in a new list in reverse order.
      Body recursion stores them in an internal stack, and a List is indeed a FILO stack.
      Tail recursion stores them in a List given in the function arguments.
      2. Enumerate the reverse mapped list from start to end, storing each value in a new List in the original order.
      Body recursion does this when the recursion stack unwinds and each value of this simulated internal List is processed.
      Tail recursion uses a call to a reverse(list) function that does a tail recursion copy that only has to enumerate the List once.

      So, in language semantics, body and tail recursion do that exact same thing in the same manner, create a reversed mapped List and create a reverse of that List. Any performance difference is caused by implementation factors outside the semantics of the language and an implementation should do whatever it can to compile the same code and performance for both types of recursion source. The failure here is on the part of the VM and not the language or the coder. Source code that functionally does the same thing should generate the same performance if not the same code. The whole core of the functional language is to do the described function and not harass the coder with implementation artifacts.

      However, this point always seems to get lost in the desire to force the VM to perform to a desire rather than to fix a fundamental VM implementation flaw. At least, that’s been my thinking on this for some time now.

      1. Sorry. I have no idea how my previous post got attached to a specific message rather than being posted at the top level. I just used the “Leave a Reply” text box at the bottom and not the “Reply” button for that message. There seems to be no way to edit or fix it from my end.

        1. Hm it’s weird imo you gotta scroll all the way to the bottom… it gets a bit confusing with these long comments here 🙂 Also no idea how to reorder them from my side.

      2. Hi there, thanks for checking the post out. 🙂

        The article mentioned me using Elixir 1.3 and Erlang 19 – I added information about the CPU (i7-4790) and OS (Linux Mint 17.3).

        Of course – different sized inputs can cause different performance characteristics. It is well possible that for large inputs (or small inputs) the performance characteristics reverse and all of the sudden the other is faster or slower. This part of the reason why I want to introduce multiple inputs to Benchee to see the differing results. That’s why I always want people to run their own benchmarks 🙂 So, thanks for doing that. And there seems to be a difference depending on input size, it has nothing to do with statistical stability or the run time but just the input size imo. Couple of things first:

        Where did you see that information about Benchee with 100s of milliseconds? I’m asking because I’m the author of Benchee and the only thing I recommend is that measures take at least 10 microseconds to execute and will rerun them if they don’t as otherwise it is too unreliable. So I’m wondering if I made a typo somewhere in the docs – which is well possible 😀 The benchmarks took well over 100 microseconds so they should be more than fine imo.

        The results in the blog post are more stable from what I can see, as their standard deviation (measure of how widespread the measurements are) is lower than the ones reported in your benchmarks. That makes sense to a point, as more data also means more garbage which means more garbage collection and benchee does not yet have an option to try and avoid that.

        My system only has 16GB RAM and with a list that big it starts swapping which definitely negatively impacts the benchmarking performance. The other thing with your benchmark is that you only gave it a run time of 10 seconds while an individual run already takes almost 10 seconds or even more. This means benchee could only measure one run so there are not enough samples to draw statistical conclusions. For good conclusions we need 100s or 1000s of runs. I haven’t implemented confidence intervals yet but it’d be very low. It could also be that the no TCO was just unlucky and garbage collection hit while it was running or that swapping started while it was running.
        The thing I’m wondering about though is why it still reports a deviation… that shouldn’t be with just one run. Will have to look into that.

        All that said I ran the bencmark with 10_000_000 elements which is the highest I could do without swapping. And I could somewhat replicate your results – the TCO implementations become twice as fast as the one without TCO. So, thanks for pointing that – I knew it was a possibility but didn’t thoroughly follow it in the original blog post 🙂 Here are my results: https://gist.github.com/PragTob/d07b88d9c5949d73e2ccc34657955699

        I do agree with your sentiment that something that is semantically equivalent should be about as fast. However, I’m not sure – TCO is an optimization and imo the non TCO version has to store more call stack data and the TCO version needs an additional accumulator so they are somewhat different.

        1. Thanks for the prompt reply :). I saw the recommendation that Benchee tests should run in 100s of milliseconds in a users forum somewhere (that I’ve, of course, now lost I should’ve kept a link). Experience has taught me that many factors affect benchmark and statistical results so I did some searching for user comments on Benchee first to help me get some usage context.

          By “stability”, I did not mean the reported standard deviation but rather the repeatability of reported stats over multiple test runs. They greatly stabilized when I vastly increased the input size suggesting that there are internal mechanisms interfering with the reported results for the smaller run times.

          My platform is x64 with 32GB ( I really should’ve included that too). Your observations on the different input sizes and platform conditions really support the assertions that TR vs. BR speeds can vary quite a bit because of the different internal mechanisms that can be triggered by different platform and runtime conditions.

          My point on language semantics is that the internal method used by the BR version to create what is functionally a List on a call stack is outside the semantics of the language. The VM could use any other method to do the same thing: for example, allocating a heap List for the stacked recursion frame data and only using the function stack frame for call/return data. The internal method is immaterial to what the language itself is stating for the map function: map values into an inferred list in reverse order and then enumerate that list to reverse it. The BR inferred list created on the first enumeration pass is semantically equivalent to the one in the TR version that is passed as an argument. A good VM implementation should compile semantically equivalent source code in the same manner so that both version perform the same.

          However, I don’t mean to say that a TR and BR version is always semantically the same. This is only true if the enumeration pass count is the same. For a function like length(list), this is not true as the TR version only requires one enumeration pass. The BR version always does two. It enumerates the given list to create a semantically inferred list of nils with the same length, and then enumerates over that semantically inferred list to accumulate the length. A true TR equivalent would have to create that list of nils with the same length and then enumerate it to return the length. So there is no true BR equivalent version to the preferred TR single-pass version.

          I have found that these semantic points are never properly covered in the guides to functional languages and, as a result, coders get polarized and frothy about the merits between tangled tail recursion or clearer body recursion source code. A coder should be able to use either and, if semantically the same, there should be no performance difference. It’s just two different ways to specify the same function.

  2. Hey there, thanks for the comment and the praise 🙂

    I can’t reproduce your results over here, maybe I’m doing something wrong. You can check out my implementation here: https://github.com/PragTob/elixir_playground/blob/master/lib/my_map.ex#L18-L32
    PR with a new implementation/fix for my implementation highly welcome!

    This gives me these results which are about the same:

    Name                                    ips        average    deviation         median
    stdlib map                          6335.80       157.83μs     (±4.06%)       156.00μs
    map simple without TCO              6291.33       158.95μs     (±9.99%)       156.00μs
    map TCO no reverse                  5780.59       172.99μs     (±5.29%)       170.00μs
    map with TCO reverse new acc        4936.33       202.58μs     (±7.65%)       211.00μs
    map with TCO reverse                4892.61       204.39μs    (±10.34%)       198.00μs
    map with TCO and ++                    5.76    173683.95μs     (±0.34%)    173404.00μs
    
    Comparison: 
    stdlib map                          6335.80
    map simple without TCO              6291.33 - 1.01x slower
    map TCO no reverse                  5780.59 - 1.10x slower
    map with TCO reverse new acc        4936.33 - 1.28x slower
    map with TCO reverse                4892.61 - 1.29x slower
    map with TCO and ++                    5.76 - 1100.43x slower
    

    Personally, I’d also be worried a bit if it made a difference as I’d expect such things to be compile time optimizations (e.g. the pulling out of a variable and handing it in).

    edit: forgot to mention, also increased the benchmarking and warmup time to make randomness less of a factor

    1. Hey! Thanks for your answer (and help formatting my comment!).

      I’ve add “my” implementation and opened PR, so the changes with example results and couple comments are easily to be checked here: https://github.com/PragTob/elixir_playground/pull/1

      Results are very interesting as it seems, even if our approaches are very similar (I’m reversing in my termination version of function, and I pass the arguments in slightly different order) yet the performance characteristics are quite different. Are your findings similar to what I’ve experienced?

      Thanks for this very exciting thread!
      Best regards!

  3. The order of arguments will likely matter when we generate the branching code. The order of arguments will specially matter if performing binary matching. The order of function clauses matter too although I am not sure if it is measurable (if the empty clause comes first or last). However, there should be no difference in the version using a variable or not. Those should be equivalent.

  4. Wow I simply mention your name and not even an hour later you magically appear. Thanks Jose for explaining ❤ I'll incorporate that into the post in the lunch break or something 🙂 Great community Elixir/Erlang has with you all so friendly and approachable! 🙂

  5. Thanks for the post! Just a suggestion for readability: you could probably make those code snippets not display a horizontal line in between each line. I find it makes the code much harder to read personally, it’s probably a config on the github hosted service

    1. Hey there! Thanks for the comment. I just had a look at where the horizontal lines come from… and thing is embedding seems to be implemented as a table and this layout (here on the wordpress side) seems to have a style where tables get lines and as a non paying user I can’t change the CSS (for all that I know) so the only option would be to switch the whole theme which I might still consider but is a bigger step :/

Leave a reply to PragTob Cancel reply