Slides: Where do Rubyists go?

I gave my first ever keynote yesterday at Ruby on Ice, which was a lot of fun. A lot of the talk is based on my “Where do Rubyists go?”-survey but also researching and looking into languages. The talk looks into what programming languages Ruby developers learn for work or in their free time, what the major features of those languages are and how that compares to Ruby. What does it tell us about Ruby and our community?

Slides can be viewed here or on speakerdeck, slideshare or PDF

Abstract

Many Rubyists branch out and take a look at other languages. What are similarities between those languages and ruby? What are differences? How does Ruby influence these languages?

Advertisements

Surprises with Nested Transactions, Rollbacks and ActiveRecord

Lately I acquired a new hobby. I went around and asked experience Rails developers, whom I respect and value a lot, how many users the following script would create:

The result should be the same on pretty much any database and any Rails version. For the sake of argument you can assume Rails 5.1 and Postgres 9.6 (what I tested it with).

So, how many users does it create? No one from more than a hand full of people I asked got the answer right (including myself).

The answer is 2.

Wait, WHAT?

Yup you read that right. It creates 2 users, the rollback is effectively useless here. Ideally this should create one user (Kotori), but as some people know nested transactions isn’t really a thing that databases support (save for MS-SQL apparently). People, whom I asked and knew this, then guessed 0 because well if I can’t rollback a part of it, better safe than sorry and roll all of it back, right?

Well, sadly the inner transaction rescues the rollback and then the outer transaction happily commits all of it. 😦

Before you get all worried – if an exception is raised and not caught the outer transaction can’t commit and hence 0 users are created as expected.

A fix

So, what can we do? When opening a transaction, we can pass requires_new: true to the transaction which will emulate a “real” nested transaction using savepoints:

As you’d expect this creates just one user.

Nah, doesn’t concern me I’d never write code like this!

Sure, you probably straight up won’t write code like this in a file. However, split across multiple files – I think so. You have one unit of business logic that you want to run in a transaction and then you start reusing it in another method that’s also wrapped in another transaction. Happens more often than you think.

Plus it can happen even more often than that as every save operation is wrapped in its own transaction (for good reasons). That means, as soon as you save anything inside a transaction or you save/update records as part of a callback you might run into this problem.

Here’s a small example highlighting the problem:

As you probably expect by now this creates 2 users. And yes, I checked – if you run create with rollback: true outside of the transaction no user is created. Of course, you shouldn’t raise rollbacks in callbacks but I’m sure that someone somewhere does it.

In case you want to play with this, all of these examples (+ more) are up at my rails playground.

The saddest part of this surprise…

Unless you stumbled across this before, chances are this is at least somewhat surprising to you. If you knew this before, kudos to you. The saddest part is that this shouldn’t be a surprise to anyone though. A lot of what is written here is part of the official documentation, including the exact example I used. It introduces the example with the following wonderful sentence:

For example, the following behavior may be surprising:

As far as I can tell this documentation with the example has been there for more than 9 years, and fxn added the above sentence about 7 years ago.

Why do I even blog about this when it’s in the official documentation all along? I think this deserves more attention and more people should know about it to avoid truly bad surprises. The fact that nobody I asked knew the answer encouraged me to write this. We should all take care to read the documentation of software we use more, we might find something interesting you know.

What do we learn from this?

READ THE DOCUMENTATION!!!!

Careful what you measure: 2.1 times slower to 4.2 times faster – MJIT versus TruffleRuby

Have you seen the MJIT benchmark results? Amazing, aren’t they? MJIT basically blows the other implementations out of the water! What were they doing all these years? That’s it, we’re done here right?

Well, not so fast as you can infer from the title. But before we can get to what I take issue with in these particular benchmarks (you can of course jump ahead to the nice diagrams) we gotta get some introductions and some important benchmarking basics out of the way.

MJIT? Truffle Ruby? WTF is this?

MJIT currently is a branch of ruby on github by Vladimir Makarov, GCC developer, that implements a JIT (Just In Time Compilation) on the most commonly used Ruby interpreter/CRuby. It’s by no means final, in fact it’s in a very early stage. Very promising benchmarking results were published on the 15th of June 2017, which are in major parts the subject of this blog post.

TruffleRuby is an implementation of Ruby on the GraalVM by Oracle Labs. It poses impressive performance numbers as you can see in my latest great “Ruby plays Go Rumble”. It also implements a JIT, is known to take a bit of a warmup but comes out being ~8 times faster than Ruby 2.0 in the previously mentioned benchmark.

Before we go further…

I have enormous respect for Vladimir and think that MJIT is an incredibly valuable project. Realistically it might be one of our few shots to get a JIT into mainstream ruby. JRuby had a JIT and great performance for years, but never got picked up by the masses (topic for another day).

I’m gonna critique the way the benchmarks were done, but there might be reasons for that, that I’m missing (gonna point out the ones I know). After all, Vladimir has been programming for way longer than I’m even alive and also knows more about language implementations than I do obviously.

Plus, to repeat, this is not about the person or the project, just the way we do benchmarks. Vladimir, in case you are reading this 💚💚💚💚💚💚

What are we measuring?

When you see a benchmark in the wild, first you gotta ask “What was measured?” – the what here comes in to flavors: code and time.

What code are we benchmarking?

It is important to know what code is actually being benchmarked, to see if that code is actually relevant to us or a good representation of a real life Ruby program. This is especially important if we want to use benchmarks as an indication of the performance of a particular ruby implementation.

When you look at the list of benchmarks provided in the README (and scroll up to the list what they mean or look at them) you can see that basically the top half are extremely micro benchmarks:

Selection_041.png

What’s benchmarked here are writes to instance variables, reading constants, empty method calls, while loops and the like. This is extremely micro, maybe interesting from a language implementors point of view but not very indicative of real world ruby performance. The day looking up a constant will be the performance bottle neck in Ruby will be a happy day. Also, how much of your code uses while loops?

A lot of the code (omitting the super micro ones) there isn’t exactly what I’d call typical ruby code. A lot of it is more a mixture of a script and C-code. Lots of them don’t define classes, use a lot of while and for loops instead of the more typical Enumerable methods and sometimes there’s even bitmasks.

Some of those constructs might have originated in optimizations, as they are apparently used in the general language benchmarks. That’s dangerous as well though, mostly they are optimized for one specific platform, in this case CRuby. What’s the fastest Ruby code on one platform can be way slower on the other platforms as it’s an implementation detail (for instance TruffleRuby uses a different String implementation). This puts the other implementations at an inherent disadvantage.

The problem here goes a bit deeper, whatever is in a popular benchmark will inevitably be what implementations optimize for and that should be as close to reality as possible. Hence, I’m excited what benchmarks the Ruby 3×3 project comes up with so that we have some new more relevant benchmarks.

What time are we measuring?

This is truly my favorite part of this blog post and arguably most important. For all that I know the time measurements in the original benchmarks were done like this: /usr/bin/time -v ruby $script which is one of my favorite benchmarking mistakes for programming languages commonly used for web applications. You can watch me go on about it for a bit here.

What’s the problem? Well, let’s analyze the times that make up the total time you measure when you just time the execution of a script: Startup, Warmup and Runtime.

Selection_043.png

  • Startup – the time until we get to do anything “useful” aka the Ruby Interpreter has started up and has parsed all the code. For reference, executing an empty ruby file with standard ruby takes 0.02 seconds for me, MJIT 0.17 seconds and for TruffleRuby it takes 2.5 seconds (there are plans to significantly reduce it though with the help of Substrate VM). This time is inherently present in every measured benchmark if you just time script execution.
  • Warmup – the time it takes until the program can operate at full speed. This is especially important for implementations with a JIT. On a high level what happens is they see which code gets called a lot and they try to optimize this code further. This process takes a lot of time and only after it is completed can we truly speak of “peak performance”. Warmup can be significantly slower than runtime. We’ll analyze the warmup times more further down.
  • Runtime – what I’d call “peak performance” – run times have stabilized. Most/all code has already been optimized by the runtime. This is the performance level that the code will run at for now and the future. Ideally, we want to measure this as 99.99%+ of the time our code will run in a warmed up already started state.

Interestingly, the startup/warmup times are acknowledged in the original benchmark but the way that they are dealt with simply lessens their effect but is far from getting rid of them: “MJIT has a very fast startup which is not true for JRuby and Graal Ruby. To give a better chance to JRuby and Graal Ruby the benchmarks were modified in a way that Ruby MRI v2.0 runs about 20s-70s on each benchmark”.

I argue that in the greater scheme of things, startup and warmup don’t really matter when we are talking about benchmarks when our purpose is to see how they perform in a long lived process.

Why is that, though? Web applications for instance are usually long lived, we start our web server once and then it runs for hours, days, weeks. We only pay the cost of startup and warmup once in the beginning, but run it for a much longer time until we shut the server down again. Normally servers should spend 99.99%+ of their time in the warmed up runtime “state”. This is a fact, that our benchmarks should reflect as we should look for what gives us the best performance for our hours/days/weeks of run time, not for the first seconds or minutes of starting up.

A little analogy here is a car. You wanna go 300 kilometers as fast as possible (straight line). Measuring as shown above is the equivalent of measuring maybe the first ~500 meters. Getting in the car, accelerating to top speed and maybe a bit of time on top speed. Is the car that’s fastest on the first 500 meters truly the best for going 300 kilometers at top speed? Probably not. (Note: I know little about cars)

What does this mean for our benchmark? Ideally we should eliminate startup and warmup time. We can do this by using a benchmarking library written in ruby that also runs the benchmark for a couple of times before actually taking measurements (warmup time). We’ll use my own little library as it means no gem required and it’s well equipped for the rather long run times.

But does startup and warmup truly never matter? It does matter. Most prominently it matters during development time – starting the server, reloading code, running tests. For all of those you gotta “pay” startup and warmup time. Also, if you develop a UI application  or a CLI tool for end users startup and warmup might be a bigger problem, as startup happens way more often. You can’t just warm it up before you take it into the load balancer. Also, running tasks periodically as a cronjob on your server will have to pay theses costs.

So are there benefits to measuring with startup and warmup included? Yes, for one for the use cases mentioned above it is important. Secondly, measuring with time -v gives you a lot more data:


tobi@speedy $ /usr/bin/time -v ~/dev/graalvm-0.25/bin/ruby pent.rb
Command being timed: "/home/tobi/dev/graalvm-0.25/bin/ruby pent.rb"
User time (seconds): 83.07
System time (seconds): 0.99
Percent of CPU this job got: 555%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.12
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1311768
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 57
Minor (reclaiming a frame) page faults: 72682
Voluntary context switches: 16718
Involuntary context switches: 13697
Swaps: 0
File system inputs: 25520
File system outputs: 312
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

You get lots of data, among which there’s memory usage, CPU usage, wall clock time and others which are also important for evaluating language implementations which is why they are also included in the original benchmarks.

Setup

Before we (finally!) get to the benchmarks, the obligatory “This is the system I’m running this on”:

The ruby versions in use are MJIT as of this commit from 25th of August compiled with no special settings, graalvm 25 and 27 (more on that in a bit) as well as CRuby 2.0.0-p648 as a baseline.

All of this is run on my Desktop PC running Linux Mint 18.2 (based on Ubuntu 16.04 LTS) with 16 GB of memory and an i7-4790 (3.6 GHz, 4 GHz boost).


tobi@speedy ~ $ uname -a
Linux speedy 4.10.0-33-generic #37~16.04.1-Ubuntu SMP Fri Aug 11 14:07:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I feel it’s especially important to mention the setup in here, as when I first did these benchmarks for Polyconf on my dual core notebook TruffleRuby had significantly worse results. I think graalvm benefits from the 2 extra cores for warmup etc, as the CPU usage across cores is also quite high.

You can check out the benchmarking script used etc. as part of this repo.

But… you promised benchmarks, where are they?

Sorry, I think the theory is more important than the benchmarks themselves, although they undoubtedly help illustrate the point. We’ll first get into why I chose the pent.rb benchmark as a subject and why I run it with a slightly old versions of graalvm (no worries, current version coming in later on). Then, finally, graphs and numbers.

Why this benchmark?

First of all, the original benchmarks were done with graalvm-0.22. Attempting to reproduce the results with the (at the time current) graalvm-0.25 proved difficult as a lot of them had already been optimized (and 0.22 contained some genuine performance bugs).

One that I could still reproduce the performance problems with was pent.rb and it also seemed like a great candidate to show that something is flawed. In the original benchmarks it is noted down as 0.33 times the performance of Ruby 2.0 (or well, 3 times slower). All my experience with TruffleRuby told me that this is most likely wrong. So, I didn’t choose it because it was the fastest one on TruffleRuby, but rather the opposite – it was the slowest one.

Moreover, while a lot of it isn’t exactly idiomatic ruby code to my mind (no classes, lots of global variables) it uses quite a lot Enumerable methods such as each, collect, sort and uniq while refraining from bitmaskes and the like. So I also felt that it’d make a comparatively good candidate from here.

The way the benchmark is run is basically the original benchmark put into a loop so it is repeated a bunch of times so we can measure the times during warmup and later runtime to get an average of the runtime performance.

So, why am I running it on the old graalvm-0.25 version? Well, whatever is in a benchmark is gonna get optimized making the difference here less apparent.

We’ll run the new improved version later.

MJIT vs. graalvm-0.25

So on my machine the initial execution of the pent.rb benchmark (timing startup, warmup and runtime) on TruffleRuby 0.25 took 15.05 seconds while it just took 7.26 seconds with MJIT. Which has MJIT being 2.1 times faster. Impressive!

What’s when we account for startup and warmup though? If we benchmark just in ruby startup time already goes away, as we can only start measuring inside ruby once the interpreter has started. Now for warmup, we run the code to benchmark in a loop for 60 seconds of warmup time and 60 seconds for measuring the actual runtime. I plotted the execution times of the first 15 iterations below (that’s about when TruffleRuby stabilizes):

2_warmup.png
Execution time of TruffleRuby and MJIT progressing over time – iteration by iteration.

As you can clearly see, TruffleRuby starts out a lot slower but picks up speed quickly while MJIT stay more or less consistent. What’s interesting to see is that iteration 6 and 7 of TrufleRuby are slower again. Either it found a new optimization that took significant time to complete or a deoptimization had to happen as the constraints of a previous optimization were no longer valid. TruffleRuby stabilizes from there and reaches peak performance.

Running the benchmarks we get an average (warm) time for TruffleRuby of 1.75 seconds and for MJIT we get 7.33 seconds. Which means that with this way of measuring, TruffleRuby is suddenly 4.2 times faster than MJIT.

We went from 2.1 times slower to 4.2 times faster and we only changed the measuring method.

I like to present benchmarking numbers in iterations per second/minute (ips/ipm) as here “higher is better” so graphs are far more intuitive, our execution times converted are 34.25 iterations per minute for TruffleRuby and 8.18 iterations per minute for MJIT. So now have a look at our numbers converted to iterations per minute compared for the initial measuring method and our new measuring method:

2_comparison_before_after.png
Results of timing the whole script execution (initial time) versus the average execution time warmed up.

You can see the stark contrast for TruffleRuby caused by the hefty warmup/long execution time during the first couple of iterations. MJIT on the other hand, is very stable. The difference is well within the margin of error.

Ruby 2.0 vs MJIT vs. graalvm-0.25 vs. graalvm-0.27

Well, I promised you more data and here is more data! This data set also includes CRuby 2.0 as the base line as well as the new graalvm.

initial time (seconds) ipm of initial time average (seconds) ipm of average after warmup Standard Deviation as part of total
CRuby 2.0 12.3 4.87 12.34 4.86 0.43%
TruffleRuby 0.25 15.05 3.98 1.75 34.25 0.21%
TruffleRuby 0.27 8.81 6.81 1.22 49.36 0.44%
MJIT 7.26 8.26 7.33 8.18 2.39%
4_warmup.png
Execution times by iteration in second. CRuby stops appearing because that were already all the iterations I had.

We can see that TruffleRuby 0.27 is already faster than MJIT in the first iteration, which is quite impressive. It’s also lacking the weird “getting slower” around the 6th iteration and as such reaches peak performance much faster than TruffleRuby 0.25. It also gets faster overall as we can see if we compare the “warm” performance of all 4 competitors:

4_comparison.png
Iterations per Minute after warmup as an average of our 4 competitors.

So not only did the warmup get much faster in TruffleRuby 0.27 the overall performance also increased quite a bit. It is now more than 6 times faster than MJIT. Of course, some of it is probably the TruffleRuby team tuning it to the existing benchmark, which reiterates my point that we do need better benchmarks.

As a last fancy graph for you I have the comparison of measuring the runtime through time versus giving it warmup time, then benchmarking multiple iterations:

4_comparison_before_after.png
Difference between measuring whole script execution versus letting implementations warmup.

CRuby 2 is quite consistent as expected, TruffleRuby already manages a respectable out of the box performance but gets even faster. I hope this helps you see how the method of measuring can achieve drastically different results.

Conclusion

So, what can we take away? Startup time and warmup are a thing and you should think hard about whether those times are important for you and if you want to measure them. For web applications, most of the time startup and warmup aren’t that important as 99.99%+ you’ll run with a warm “runtime” performance.

Not only what time we measure is important, but also what code we measure. Benchmarks should be as realistic as possible so that they are as significant as possible. What a benchmark on the Internet check most likely isn’t directly related to what your application does.

ALWAYS RUN YOUR OWN BENCHMARKS AND QUESTION BOTH WHAT CODE IS BENCHMARKED, HOW IT IS BENCHMARKED AND WHAT TIMES ARE TAKEN

(I had this in my initial draft, but I ended up quite liking it so I kept it around)

edit1: Added CLI tool specifically to where startup & warmup counts as well as a reference to Substrate VM for how TruffleRuby tries to combat it 🙂

edit2: Just scroll down a little to read an interesting comment by Vladimir

Video & Slides: How Fast is it Really? Benchmarking in Practice (Ruby)

My slides & video from visiting the excellent WRUG (Warsaw Ruby Users Group). The talk is a variation of the similarly named elixir talk, but it is ever evolving and here more focused on Ruby. It covers mostly how to setup and run good benchmarks, traps you can fall into and tools you should use.

You can also have a look at the slides right here or at speakerdeck, slideshare or PDF.

Abstract

“What’s the fastest way of doing this?” – you might ask yourself during development. Sure, you can guess what’s fastest or how long something will take, but do you know? How long does it take to sort a list of 1 Million elements? Are tail-recursive functions always the fastest?

Benchmarking is here to answer these questions. However, there are many pitfalls around setting up a good benchmark and interpreting the results. This talk will guide you through, introduce best practices and show you some surprising benchmarking results along the way.

edit: If you’re interested there’s another iteration of this talk that I gave at the pivorakmeetup

Slides: Code, Comments, Concepts, Comprehension – Conclusion?

The following is the first part of my visit to Warsaw in April (sorry for the super late post!). As part of the visit, I also visited Visuality and spent an evening there giving a presentation and discussing the topics afterwards for a long time. We capped it off some board games 😉 I had a great time and the discussions were super interesting.

The talk is a reworked old goldie (“Code is read many more times than written” / “Optimizing for Readability”) and is about readable code and keeping readable code. It’s evolved as I evolve – I learn new things, assign differing importance to different topics and discover entirely new important topicss.

You can view the slides here or on speakerdeck, slideshare or PDF.

 

Slides: Ruby to Elixir – what’s great and what you might miss

This is a talk I gave at the Polygot Tech Meetup in Berlin in April. It has parts of my previous elixir talk while adding a new perspective more directly comparing Ruby and Elixir as well as being shorter as it was used as an intro to a fun panel discussion.

Abstract

Elixir and Phoenix are known for their speed, but that’s far from their only benefit. Elixir isn’t just a fast Ruby and Phoenix isn’t just Rails for Elixir. Through pattern matching, immutable data structures and new idioms your programs can not only become faster but more understandable and maintainable. While we look at the upsides we’ll also have a look at what you might be missing and could be improved.

Slides

Don’t you Struct.new(…).new(…)

As I just happened upon it again, I gotta take a moment to talk about one my most despised ruby code patterns: Struct.new(...).new – ever since I happened upon it for the first time.

Struct.new?

Struct.new is a convenient way to create a class with accessors:

2.2.2 :001 > Extend = Struct.new(:start, :length)
 => Extend 
2.2.2 :002 > instance = Extend.new(10, 20)
 => #<struct Extend start=10, length=20> 
2.2.2 :003 > instance.start
 => 10 
2.2.2 :004 > instance.length
 => 20

That’s neat, isn’t it? It is, but like much of Ruby’s power it needs to be wielded with caution.

Where Struct.new goes wrong – the second new

When you do Struct.new you create an anonymous class, another new on it creates an instance of that class. Therefore, Struct.new(...).new(...) creates an anonymous class and creates an instance of it at the same time. This is bad as we create a whole class to create only one instance of it! That is a capital waste.

As a one-off use it might be okay, for instance when you put it in a constant. The sad part is, that this is not the only case I’ve seen it used. As in the first case where I encountered it, I see it used inside of code that is invoked frequently. Some sort of hot loop with calculations where more than one value is needed to represent some result of the calculation. Here programmers sometimes seem to reach for Struct.new(...).new(...).

Why is that bad?

Well it incurs an unnecessary overhead, creating the class every time is unnecessary. Not only that, as far as I understand it also creates new independent entries in the method cache. For JITed implementations (like JRuby) the methods would also be JITed independently. And it gets in the way of profiling, as you see lots of anonymous classes with only 1 to 10 method calls each.

But how bad is that performance hit? I wrote a little benchmark where an instance is created with 2 values and then those 2 values are read one time each. Once with Struct.new(...).new(...), once where the Struct.new(...) is saved in an intermediary constant. For fun and learning I threw in a similar usage with Array and Hash.

Benchmark.ips do |bm|
  bm.report "Struct.new(...).new" do
    value = Struct.new(:start, :end).new(10, 20)
    value.start
    value.end
  end

  SavedStruct = Struct.new(:start, :end)
  bm.report "SavedStruct.new" do
    value = SavedStruct.new(10, 20)
    value.start
    value.end
  end

  bm.report "2 element array" do
    value = [10, 20]
    value.first
    value.last
  end

  bm.report "Hash with 2 keys" do
    value = {start: 10, end: 20}
    value[:start]
    value[:end]
  end

  bm.compare!
end

I ran those benchmarks with CRuby 2.3. And the results, well I was surprised how huge the impact really is. The “new-new” implementation is over 33 times slower than the SavedStruct equivalent. And over 60 times slower than the fastest solution (Array), although that’s also not my preferred solution.

Struct.new(...).new    137.801k (± 3.0%) i/s -    694.375k
SavedStruct.new      4.592M (± 1.7%) i/s -     22.968M
2 element array      7.465M (± 1.4%) i/s -     37.463M
Hash with 2 keys      2.666M (± 1.6%) i/s -     13.418M
Comparison:
2 element array:  7464662.6 i/s
SavedStruct.new:  4592490.5 i/s - 1.63x slower
Hash with 2 keys:  2665601.5 i/s - 2.80x slower
Struct.new(...).new:   137801.1 i/s - 54.17x slower
Benchmark in iterations per second (higher is better)
Benchmark in iterations per second (higher is better)

But that’s not all…

This is not just about performance, though. When people take this “shortcut” they also circumvent one of the hardest problems in programming – Naming. What is that Struct with those values? Do they have any connection at all or were they just smashed together because it seemed convenient at the time. What’s missing is the identification of the core concept that these values represent. Anything that says that this is more than a clump of data with these two values, that performs very poorly.

So, please avoid using Struct.new(...).new(...) – use a better alternative. Don’t recreate a class over and over – give it a name, put it into a constant and enjoy a better understanding of the concepts and increased performance.

Benchmarking a Go AI in Ruby: CRuby vs. Rubinius vs. JRuby vs. Truffle/Graal

The world of Artificial Intelligences is often full of performance questions. How fast can I compute a value? How far can I look ahead in a tree? How many nodes can I traverse?

In Monte Carlo Tree Search one of the most defining questions is “How many simulations can I run per second?”. If you want to learn more about Monte Carlo Tree Search and its application to the board game Go I recommend you the video and slides of my talk about that topic from Rubyconf 2015.

Implementing my own AI – rubykon – in ruby of course isn’t going to get me the fastest implementation ever. It forces you to really do less and therefore make nice performance optimization, though. This isn’t about that either. Here I want to take a look at another question: “How fast can Ruby go?” Ruby is a language with surprisingly many well maintained implementations. Most prominently CRuby, Rubinius, JRuby and the newcomer JRuby + Truffle. How do they perform in this task?

The project

Rubykon is a relatively small project – right now the lib directory has less than 1200 lines of code (which includes a small benchmarking library… more on that later). It has no external runtime dependencies – not even the standard library. So it is very minimalistic and also tuned for performance.

Setup

The benchmarks were run pre the 0.3.0 rubykon version on the 8th of November (sorry writeups always take longer than you think!) with the following concrete ruby versions (versions slightly abbreviated in the rest of the post):

  • CRuby 1.9.3p551
  • CRuby 2.2.3p173
  • Rubinius 2.5.8
  • JRuby 1.7.22
  • JRuby 9.0.3.0
  • JRuby 9.0.3.0 run in server mode and with invoke dynamic enabled (denoted as + id)
  • JRuby + Truffle Graal with master from 2015-11-08 and commit hash fd2c179, running on graalvm-jdk1.8.0

You can find the raw data (performance numbers, concrete version outputs, benchmark results for different board sizes and historic benchmark results) in this file.

This was run on my pretty dated desktop PC (i7 870):


tobi@tobi-desktop ~ $ uname -a
Linux tobi-desktop 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
tobi@tobi-desktop ~ $ java -version
openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
tobi@tobi-desktop ~ $ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 30
Stepping:              5
CPU MHz:               1200.000
BogoMIPS:              5887.87
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

First benchmark: Simulation + Scoring on 19×19

This benchmark uses benchmark-ips to see how many playouts (simulation + scoring) can be done per second. This is basically the “evaluation function” of the Monte Carlo Method. Here we start with an empty board and then play random valid moves until there are no valid moves anymore and then we score the game. The performance of a MCTS AI is hugely dependent on how fast that can happen.

Benchmarks were run with a warmup time of 60 seconds and a run time of 30 seconds. The small black bars in the graph denote standard deviation. Results:

19_19_ips.png
Full 19×19 playout, iterations per second (higher is better)
Ruby Version iterations per second standard deviation
CRuby 1.9.3p551 44.952 8.90%
CRuby 2.2.3p173 55.403 7.20%
Rubinius 2.5.8 40.911 4.90%
JRuby 1.7.22 63.456 15.80%
JRuby 9.0.3.0 73.479 6.80%
JRuby 9.0.3.0 + invoke dynamic 121.265 14.00%
JRuby + Truffle 192.42 14.00%

JRuby + Truffle runs on a slightly modified version of benchmark-ips. This is done because it is a highly optimizing and speculative runtime that leads to bad results after warmup. This is explained here.

Second benchmark: Full UCT Monte Carlo Tree Search with 1000 playouts

This benchmark does a full Monte Carlo Tree Search, meaning choosing a node to investigate, doing a full simulation and scoring there and then propagating the results back in the tree before starting over again. As the performance is mostly dependent on the playouts the graph looks a lot like the one above.

This uses benchmark-avg, which I wrote myself and (for now) still lives in the rubykon repository. Why a new benchmarking library? In short: I needed something for more “macro” benchmarks that gives nice output like benchmark-ips. Also, I wanted a benchmarking tool that plays nice with Truffle – which means doing warmup and run of a benchmark directly after one another, as detailed in this issue.

This uses a warmup time of 3 minutes and a run time of 2 minutes. Along with the iterations per minute, we have another graph depicting average run time.

19_19_ipm.png
MCTS on 19×19 with 1000 playouts, iterations per minute (higher is better)
19_19_avg.png
MCTS on 19×19 with 1000 playouts, average run time (lower is better)
Ruby Version iterations per minute average time (s) standard deviation
CRuby 1.9.3p551 1.61 37.26 2.23%
CRuby 2.2.3p173 2.72 22.09 1.05%
Rubinius 2.5.8 2.1 28.52 2.59%
JRuby 1.7.22 3.94 15.23 1.61%
JRuby 9.0.3.0 3.7 16.23 2.48%
JRuby 9.0.3.0 + invoke dynamic 7.02 8.55 1.92%
JRuby + Truffle 9.49 6.32 8.33%

Results here pretty much mirror the previous benchmark, although standard deviation is smaller throughout which might be because more non random code execution is involved.

Otherwise the relative performance of the different implementations is more or less the same, with the notable exception of JRuby 1.7 performing better than 9.0 (without invoke dynamic). That could be an oddity, but it is also well within the margin of error for the first benchmark.

For the discussion below I’ll refer to this benchmark, as it ran on the same code for all implementations and has a lower standard deviation overall.

Observations

The most striking observation certainly is JRuby + Truffle/Graal sits atop in the benchmarks with a good margin. It’s not that surprising when you look at previous work done here suggesting speedups of 9x to 45x as compared to CRuby. Here the speedup relative to CRuby is “just” 3.5 which teaches us to always run your own benchmarks.

It is also worth noting that Truffle first was unexpectedly very slow (10 times slower than 1.9) so I opened an issue and reported that somewhat surprising lack in performance. Then Chris Season was quick to fix it and along the way he kept an amazing log of things he did to diagnose and make it faster. If you ever wanted to take a peek into the mind of a Ruby implementer – go ahead and read it!

At the same time I gotta say that the warmup time it takes has got me worried a bit. This is a very small application with one very hot loop (generating the valid moves). It doesn’t even use the standard library. The warmup times are rather huge exactly for Truffle and I made sure to call no other code in benchmark/avg as this might deoptimize everything again. However, it is still in an early stage and I know they are working on it 🙂

Second, “normal” JRuby is faster than CRuby which is not much of a surprise to me – in most benchmarks I do JRuby comes up ~twice as fast CRuby. So when it was only ~30% faster I was actually a bit disappointed, but then remembered the --server -Xcompile.invokedynamic=true switches and enabled them. BOOM! Almost 2.6 times faster than CRuby! Almost 90% faster than JRuby without those switches.

Now you might ask: “Why isn’t this the default?” Well, it was the default. Optimizing takes time and that slows down the startup time, for say rails, significantly which is why it was deactivated by default.

If I’m missing any of these magic switches for any of the other implementations please let me know and I’ll add them.

I’m also a bit sad to see rubinius somewhere between 1.9 and 2.2 performance wise, I had higher hopes for its performance with some appropriate warmup time.

Also opal is notably missing, I couldn’t get it to run but will try again in a next version to see what V8 can optimize here.

An important word of warning to conclude the high level look at the different implementations: These benchmarks are most likely not true for your application! Especially not for rails! Benchmark yourself 🙂

Now for another question that you probably have on your mind: “How fast is this compared to other languages/implementations?” See, that’s hard to answer. No serious Go engine does pure random playouts, they all use some heuristics slowing them down significantly. But, they are still faster. Here’s some data from this computer go thread, they all refer to the 19×19 board size:

  • it is suggested than one should be able to do at least 100 000 playouts per second without heuristics
  • With light playouts Aya did 25 000 playouts in 2008
  • well known C engine pachi does 2000 heavy playouts per thread per second

Which leads us to the question…

Is this the end of the line for Ruby?

No, there are still a couple of improvements that I have in mind that can make it much faster. How much faster? I don’t know. I have this goal of 1000 playouts on 19×19 per second per thread in mind. It’s still way behind other languages, but hey we’re talking about Ruby here 😉

Some possible improvements:

  • Move generation can still be improved a lot, instead of always looking for a new valid random moves a list of valid moves could be kept around, but it’s tricky
  • Scoring can also be done faster by leveraging neighbouring cells, but it’s not the bottleneck (yet)
  • a very clever but less accurate data structure can be used for liberty counting
  • also, of course, actually parallelize it and run on multiple threads
  • I could also use an up to date CPU for a change 😉

Other than that, I’m also looking over to the ruby implementations to get better, optimize more and make it even faster. I have especially high hopes for JRuby and JRuby + Truffle here.

So in the future I’ll try to find out how fast this can actually get, which is a fun ride and has taught me a lot so far already! You should try playing the benchmark game for yourselves 🙂

The not so low cost of calling dynamically defined methods

There are a couple of mantras that exist across programming communities, one of them is to avoid duplication or to keep it DRY. Programming languages equip us with different tools to avoid duplication. In Ruby a popular way to achieve this is Metaprogramming. Methods are dynamically defined to get rid off all duplication and we rejoice – yay! There might be other problems with metaprogrammed solutions, but at least we are sure that the performance is the same as if we’d had written that duplicated code. Or are we?

As the title suggests this post examines the performance of these meta programmed methodsIf you are looking for method definition performance or pure call overhead you’ll find this information in this post by Aaron Patterson.

Before we get into the details I want to quickly highlight that this is not some theoretical micro benchmark I pulled out of thin air. These examples are derived from performance improvements on an actual project. That work was done by my friend Jason R. Clark on this pull request over at Shoes 4. As he doesn’t have time to write it up – I get to, so let’s get to it!

Let’s look at some methods!

(Please be aware that this example is simplified, of course the real code has varying parts most importantly the name of the instance_variable which is the reason why the code was meta programmed in the first place)

class FakeDimension
  def initialize(margin_start)
    @margin_start = margin_start
    @margin_start_relative = relative? @margin_start
  end

  def relative?(result)
    result.is_a?(Float) && result <= 1
  end

  def calculate_relative(result)
    (result * 100).to_i
  end

  define_method :full_meta do
    instance_variable_name = '@' + :margin_start.to_s
    value = instance_variable_get(instance_variable_name)
    value = calculate_relative value if relative? value
    value
  end

  IVAR_NAME = "@margin_start"
  define_method :hoist_ivar_name do
    value = instance_variable_get(IVAR_NAME)
    value = calculate_relative value if relative? value
    value
  end

  define_method :direct_ivar do
    value = @margin_start
    value = calculate_relative value if relative? value
    value
  end

  eval <<-CODE
  def full_string
    value = @margin_start
    value = calculate_relative value if relative? value
    value
  end
  CODE

  def full_direct
    value = @margin_start
    value = calculate_relative value if relative? value
    value
  end
end

Starting at the first define_method these are all more or less the same method. We start at a fully meta programmed version, that even converts a symbol to an instance variable name, and end with the directly defined method without any meta programming. Now with all these methods being so similar you’d expect them all to have roughly the same performance characteristics, right? Well, such is not the case as demonstrated by the following benchmark. I benchmarked these methods both for the case where the value is relative and for when it is not. The results are not too different – see the gist for details. Running the non relative version on CRuby 2.2.2 with benchmark-ips I get the following results (higher is better):

           full_meta      1.840M (± 3.0%) i/s -      9.243M
     hoist_ivar_name      3.147M (± 3.3%) i/s -     15.813M
         direct_ivar      5.288M (± 3.1%) i/s -     26.553M
         full_string      6.034M (± 3.2%) i/s -     30.179M
         full_direct      5.955M (± 3.2%) i/s -     29.807M

Comparison:
         full_string:  6033829.1 i/s
         full_direct:  5954626.6 i/s - 1.01x slower
         direct_ivar:  5288105.5 i/s - 1.14x slower
     hoist_ivar_name:  3146595.7 i/s - 1.92x slower
           full_meta:  1840087.6 i/s - 3.28x slower

And look at that, the full_meta version is over 3 times slower than the directly defined method! Of course direct_ivar is also pretty close, but it’s an unrealistic scenario as the instance variable name is what is really changing. You can interpolate the string of the method definition in the full_string version, though. This achieves results as if the method had been directly defined. But what’s happening here?

It seems that there is a higher than expected cost associated with calling instance_variable_get, creating the necessary string and calling methods defined by define_method overall. If you want to keep the full performance but still alter the code you have to resort to the evil eval and stitch your code together in string interpolation. Yay.

So what, do we all have to eval method definitions for metaprogramming now?

Thankfully no. The performance overhead is constant – if your method does more expensive calculations the overhead diminishes. This is the somewhat rare case of a method that doesn’t do much (even the slowest version can be executed almost 2 Million times per second) but is called a lot. It is one of the core methods when positioning UI objects in Shoes. Obviously we should also do the hard work and try not to call that method that often, we’re working on that and already made some nice progress. But, to quote Jason, “regardless what we do I think that Dimension is bound to always be in our hot path.”.

What about methods that do more though? Let’s take a look at an example where we have an object that has an array set as an instance variable and has a method that concatenates another array and sorts the result (full gist):

class Try

  def initialize(array)
    @array = array
  end

  define_method :meta_concat_sort do |array|
    value = instance_variable_get '@' + :array.to_s
    new_array = value + array
    new_array.sort
  end

  def concat_sort(array)
    new_array = @array + array
    new_array.sort
  end
end

We then benchmark those two methods with the same base array but two differently sized input arrays:

BASE_ARRAY        = [8, 2, 400, -4, 77]
SMALL_INPUT_ARRAY = [1, 88, -7, 2, 133]
BIG_INPUT_ARRAY   = (1..100).to_a.shuffle

What’s the result?

Small input array
Calculating -------------------------------------
    meta_concat_sort    62.808k i/100ms
         concat_sort    86.143k i/100ms
-------------------------------------------------
    meta_concat_sort    869.940k (± 1.4%) i/s -      4.397M
         concat_sort      1.349M (± 2.6%) i/s -      6.805M

Comparison:
         concat_sort:  1348894.9 i/s
    meta_concat_sort:   869940.1 i/s - 1.55x slower

Big input array
Calculating -------------------------------------
    meta_concat_sort    18.872k i/100ms
         concat_sort    20.745k i/100ms
-------------------------------------------------
    meta_concat_sort    205.402k (± 2.7%) i/s -      1.038M
         concat_sort    231.637k (± 2.5%) i/s -      1.162M

Comparison:
         concat_sort:   231636.7 i/s
    meta_concat_sort:   205402.2 i/s - 1.13x slower

With the small input array the dynamically defined method is still over 50% slower than the non meta programmed method! When we have the big input array (100 elements) the meta programmed method is still 13% slower, which I still consider very significant.

I ran these with CRuby 2.2.2, in case you are wondering if this is implementation specific. I ran the same benchmark with JRuby and got comparable results, albeit the fact that JRuby is usually 1.2 to 2 times faster than CRuby, but the slowdowns were about the same.

So in the end, what does it mean? Always benchmark. Don’t blindly optimize calls like these as in the grand scheme of things they might not make a difference. This will only be really important for you if a method gets called a lot. If it is in your library/application, then replacing the meta programmed method definitions might yield surprising performance improvements.

UPDATE 1: Shortly after this post was published coincidentally JRuby 9.0.0.3.0 was released with improvements to the call speed of methods defined by define_method. I added the benchmarks to the comments of the gist. It is 7-15% faster for full_meta and hoist_ivar_name but now the direct_ivar is about as fast as its full_meta and full_string counterparts thanks to the optimizations!

UPDATE 2: I wrote a small benchmark about what I think is the bottle neck here – instance_variable_get. It is missing the slowest case but is still up to 3 times slower than the direct access.