The not so low cost of calling dynamically defined methods

There are a couple of mantras that exist across programming communities, one of them is to avoid duplication or to keep it DRY. Programming languages equip us with different tools to avoid duplication. In Ruby a popular way to achieve this is Metaprogramming. Methods are dynamically defined to get rid off all duplication and we rejoice – yay! There might be other problems with metaprogrammed solutions, but at least we are sure that the performance is the same as if we’d had written that duplicated code. Or are we?

As the title suggests this post examines the performance of these meta programmed methodsIf you are looking for method definition performance or pure call overhead you’ll find this information in this post by Aaron Patterson.

Before we get into the details I want to quickly highlight that this is not some theoretical micro benchmark I pulled out of thin air. These examples are derived from performance improvements on an actual project. That work was done by my friend Jason R. Clark on this pull request over at Shoes 4. As he doesn’t have time to write it up – I get to, so let’s get to it!

Let’s look at some methods!

(Please be aware that this example is simplified, of course the real code has varying parts most importantly the name of the instance_variable which is the reason why the code was meta programmed in the first place)

class FakeDimension
  def initialize(margin_start)
    @margin_start = margin_start
    @margin_start_relative = relative? @margin_start
  end

  def relative?(result)
    result.is_a?(Float) && result <= 1
  end

  def calculate_relative(result)
    (result * 100).to_i
  end

  define_method :full_meta do
    instance_variable_name = '@' + :margin_start.to_s
    value = instance_variable_get(instance_variable_name)
    value = calculate_relative value if relative? value
    value
  end

  IVAR_NAME = "@margin_start"
  define_method :hoist_ivar_name do
    value = instance_variable_get(IVAR_NAME)
    value = calculate_relative value if relative? value
    value
  end

  define_method :direct_ivar do
    value = @margin_start
    value = calculate_relative value if relative? value
    value
  end

  eval <<-CODE
  def full_string
    value = @margin_start
    value = calculate_relative value if relative? value
    value
  end
  CODE

  def full_direct
    value = @margin_start
    value = calculate_relative value if relative? value
    value
  end
end

Starting at the first define_method these are all more or less the same method. We start at a fully meta programmed version, that even converts a symbol to an instance variable name, and end with the directly defined method without any meta programming. Now with all these methods being so similar you’d expect them all to have roughly the same performance characteristics, right? Well, such is not the case as demonstrated by the following benchmark. I benchmarked these methods both for the case where the value is relative and for when it is not. The results are not too different – see the gist for details. Running the non relative version on CRuby 2.2.2 with benchmark-ips I get the following results (higher is better):

           full_meta      1.840M (± 3.0%) i/s -      9.243M
     hoist_ivar_name      3.147M (± 3.3%) i/s -     15.813M
         direct_ivar      5.288M (± 3.1%) i/s -     26.553M
         full_string      6.034M (± 3.2%) i/s -     30.179M
         full_direct      5.955M (± 3.2%) i/s -     29.807M

Comparison:
         full_string:  6033829.1 i/s
         full_direct:  5954626.6 i/s - 1.01x slower
         direct_ivar:  5288105.5 i/s - 1.14x slower
     hoist_ivar_name:  3146595.7 i/s - 1.92x slower
           full_meta:  1840087.6 i/s - 3.28x slower

And look at that, the full_meta version is over 3 times slower than the directly defined method! Of course direct_ivar is also pretty close, but it’s an unrealistic scenario as the instance variable name is what is really changing. You can interpolate the string of the method definition in the full_string version, though. This achieves results as if the method had been directly defined. But what’s happening here?

It seems that there is a higher than expected cost associated with calling instance_variable_get, creating the necessary string and calling methods defined by define_method overall. If you want to keep the full performance but still alter the code you have to resort to the evil eval and stitch your code together in string interpolation. Yay.

So what, do we all have to eval method definitions for metaprogramming now?

Thankfully no. The performance overhead is constant – if your method does more expensive calculations the overhead diminishes. This is the somewhat rare case of a method that doesn’t do much (even the slowest version can be executed almost 2 Million times per second) but is called a lot. It is one of the core methods when positioning UI objects in Shoes. Obviously we should also do the hard work and try not to call that method that often, we’re working on that and already made some nice progress. But, to quote Jason, “regardless what we do I think that Dimension is bound to always be in our hot path.”.

What about methods that do more though? Let’s take a look at an example where we have an object that has an array set as an instance variable and has a method that concatenates another array and sorts the result (full gist):

class Try

  def initialize(array)
    @array = array
  end

  define_method :meta_concat_sort do |array|
    value = instance_variable_get '@' + :array.to_s
    new_array = value + array
    new_array.sort
  end

  def concat_sort(array)
    new_array = @array + array
    new_array.sort
  end
end

We then benchmark those two methods with the same base array but two differently sized input arrays:

BASE_ARRAY        = [8, 2, 400, -4, 77]
SMALL_INPUT_ARRAY = [1, 88, -7, 2, 133]
BIG_INPUT_ARRAY   = (1..100).to_a.shuffle

What’s the result?

Small input array
Calculating -------------------------------------
    meta_concat_sort    62.808k i/100ms
         concat_sort    86.143k i/100ms
-------------------------------------------------
    meta_concat_sort    869.940k (± 1.4%) i/s -      4.397M
         concat_sort      1.349M (± 2.6%) i/s -      6.805M

Comparison:
         concat_sort:  1348894.9 i/s
    meta_concat_sort:   869940.1 i/s - 1.55x slower

Big input array
Calculating -------------------------------------
    meta_concat_sort    18.872k i/100ms
         concat_sort    20.745k i/100ms
-------------------------------------------------
    meta_concat_sort    205.402k (± 2.7%) i/s -      1.038M
         concat_sort    231.637k (± 2.5%) i/s -      1.162M

Comparison:
         concat_sort:   231636.7 i/s
    meta_concat_sort:   205402.2 i/s - 1.13x slower

With the small input array the dynamically defined method is still over 50% slower than the non meta programmed method! When we have the big input array (100 elements) the meta programmed method is still 13% slower, which I still consider very significant.

I ran these with CRuby 2.2.2, in case you are wondering if this is implementation specific. I ran the same benchmark with JRuby and got comparable results, albeit the fact that JRuby is usually 1.2 to 2 times faster than CRuby, but the slowdowns were about the same.

So in the end, what does it mean? Always benchmark. Don’t blindly optimize calls like these as in the grand scheme of things they might not make a difference. This will only be really important for you if a method gets called a lot. If it is in your library/application, then replacing the meta programmed method definitions might yield surprising performance improvements.

UPDATE 1: Shortly after this post was published coincidentally JRuby 9.0.0.3.0 was released with improvements to the call speed of methods defined by define_method. I added the benchmarks to the comments of the gist. It is 7-15% faster for full_meta and hoist_ivar_name but now the direct_ivar is about as fast as its full_meta and full_string counterparts thanks to the optimizations!

UPDATE 2: I wrote a small benchmark about what I think is the bottle neck here – instance_variable_get. It is missing the slowest case but is still up to 3 times slower than the direct access.

18 thoughts on “The not so low cost of calling dynamically defined methods

  1. Great benchmark. I’d love to really understand why the difference is so great and if there are just limitations that prevent us from the same performing optimizations or if that are just hard/haven’t been implemented yet

    1. To my mind the main culprit is instance_variable_get – I’d also love to know if it could be optimized further. But that’s more of a question for Charlie/Tom/Koichi/Matz 🙂

  2. Not that surprising: instance_variable_get is slightly more expensive than using the variable directly, due to an additional method call. And define_method is invoking a clousure that might capture local variables. The classic string eval is doing exactly the same the interpreter would do when loading the source code.
    However, JRuby shows that all these stuff could be optimized away by a smart JIT compiler =)

    btw. in Ruby there is also a symbol literal that supports substitution. :”@#{var}”
    This version is slightly faster than (“@” + var).to_sym in CRuby. In JRuby however, the literal is slower, which is quite strange.

    1. For me it wasn’t surprising that it was slower, but the extent to which it was slower. I (not knowing more about VMs etc. than the occasional talk from Charlie/Tom/Zach/Matz) expected it to be in the 5 to 20% range. Not over 200%.

      Ah I didn’t know that :”string” is a proper literal, always thought it generated a string and turned that into a symbol through magic syntax. Thanks! 🙂

  3. Nice benchmark. Now I finally get why Rails (source) tends to use eval so much instead of define_method.

    Kai: Sounds like a bug with JRuby, should be reported. Are you sure instance_variable_get is more expensive? As far as I remember the interpreter actually generates very similar code when you do @var. Although I don’t remember if it’s an actual method call. It almost sounds like the instance_variable_get call could be optimized by the interpreter into the same thing as when you do @var?

  4. I think the main difference between instance_variable_get and @var is that the first one goes through the method dispatcher. In theory, the user could overwrite the instance_variable_get method, therefore, a method call is required. However, such stuff could be optimized away with a JIT compiler in JRuby (together with similar stuff like send or const_get).

    For define_method the biggest problem is the closure. A closure have to capture all local variables in the scope because you can do the following:

    x = 5
    define_method :bla do
    print x
    end
    x+=1
    bla() #=> 6

    Therefore, the variable x have to be put into some datastructure on the heap (it survives the current strack frame).

    The nasty thing: It is possible to do this even if the closure does not contain any read or write to a variable:

    def foo
    x = 5
    some_closure = proc {}
    end
    closure = foo()
    closure.binding.eval “print x”

    So the method foo cannot know if the closure it gives away needs the local variable x or not. It has to keep x as soon as a closure occurs. Every creation of a closure results into a copy of the local variables table onto the heap. I think this makes the performance so bad and it is not that easy to optimize away.

    1. The other reason instance_variable_get would go through method dispatch is because BasicObject does not have this method.

    2. When you look at the benchmark, the version that does direct instance variable access but define_method e.g. the block capture (that doesn’t need to capture anything in this case) is only 1.13 times slower than the “normal” method. So I’m pretty sure it’s the lesser of the two evils 🙂

      Plus, that is jittable/optimizable as our JRuby friends optimized exactly that in this weeks 9.0.3.0 release: http://blog.jruby.org/2015/10/performance_improvements_in_jruby_9030/#definemethod-methods

      with that improvement, you almost don’t feel the overhead of that block capture anymore:

      direct_ivar 6.953M (± 8.7%) i/s – 34.268M
      full_string 7.145M (± 4.4%) i/s – 35.668M
      full_direct 7.165M (± 4.1%) i/s – 35.813M

      The full_meta version is still 2.6 times slower. See the gist linked in the update for details 🙂

  5. I’m not too surprised about instance_variable_get being that much slower.

    When you have a direct variable access, we can essentially cache the location of the variable across accesses. If each time it’s still the same class of object, we just dive right in and grab the variable without doing expensive hash lookups based on the name.

    With instance_variable_get, we only optimize the call site for the actual call to instance_variable get, since the parameter is usually variable (a dynamic variable name as in your case). So every call ends up doing a full lookup of the variable location for each call, which in JRuby’s case involves grabbing the object’s class, grabbing a hash-based table mapping name to variable offset, and then finally going after the variable.

    I wrote a blog post about the hidden costs of hashtables here: http://blog.headius.com/2012/09/avoiding-hash-lookups-in-ruby.html

    Most Ruby implementations optimize Ruby in two ways:

    * Eliminate as many hash lookups as possible.
    * Try to optimize code across hash lookup boundaries.

    If you can’t do the first, you can’t do the second. In JRuby the optimized variable access actually inlines completely into the assembly code the JVM emits. The unoptimized access jumps through several CALL instructions and many, many memory dereferences that bog down the CPU.

    On top of all that you’ve got the cost of looking up a bazillion symbols based on a string that changes every time.

    I’ll ponder some ways to improve this.

Leave a comment