Slides: Elixir, Your Monolith and You (Elixir Berlin Version)

I was supposed to give this talk at ElixirConf.Eu, but sadly fell ill. These are the slides (still titled alpha-1) that I used to give it Elixir Berlin which was met with a great reception. Which is also why I was so looking forward to give it again and have it recorded… Anyhow, if you saw the talk and want to go through the slides again or you were looking forward to the slides – here they are.

Slides can be viewed here or on speakerdeck, slideshare or PDF

Abstract

Elixir is great, so clearly we’ll all rewrite our applications in Elixir. Mostly, you can’t and shouldn’t do that. This presentation will show you another path. You’ll see how at Liefery, we started with small steps instead of rewriting everything. This allowed us to reap the benefits earlier and get comfortable before getting deeper into it. We’ll examine in detail the tactics we used to create two Elixir apps for new requirements, and how we integrated them with our existing Rails code base.

Join us on our tale of adopting Elixir and Phoenix and see what we learned, what we loved, and what bumps we hit along the road

Advertisements

benchee is now called bunny!

edit: This was an April’s fools joke. However, bunny will remain functional. It’s only implemented as a thing wrapper around benchee so unless we completely break API (which I don’t see coming) it’ll remain functional. Continue reading for cute bunny pictures.

It is time for benchee to take the next step in its evolution as one of the prime benchmarking libraries. Going forward benchee will be called bunny!

IMG_20171207_084810_Bokeh-ANIMATION.gif
Al likes the naming change!

We waited for this very special day to announce this very special naming change – what better day to announce something is being named bunny than Easter Sunday?

It is available on hex.pm now!

But why?

We think this is an abstraction that’s really going to offer us all the flexibility that we’re going to need for future development. As we approach 1.0, we wanted to get the API just right.

This is true courage.

We also haven’t been exactly subtle dropping hints that this naming change was coming. For once I have described benchmarking as bunnies eating food on numerous occasions (each bunny is a function that tries to eat it’s input as fast as it can!). Other than that, the frequently occurring bunny pictures (or even gifs) in benchee Pull Requests could have been a hint.

Also, eating is what they do best:

IMG_20180120_094003_Bokeh-ANIMATION
Yum yum we like benchmarking

For now bunny still works a lot like benchee. However, it exposes a better and more expressive API for your pleasure. You know, bunny can’t only run like the good old benchee. No! Bunny can also sleep, hop, eat and jump!

This all comes with your own personal bunny assistant that helps you benchmark:

After all this hard work, the bunny needs to sleep a bit though:

IMG_20180216_102445-ANIMATION.gif

This is clearly better than any other (benchmarking) library out there. What are you waiting for? Go and get bunny now. Also, I mean… just LOOK AT THEM!

IMG_20180120_103418.jpg

IMG_20171221_144500.jpg

IMG_20171221_144657_Bokeh(1).jpg

Video & Slides: Stop Guessing and Start Measuring – Benchmarking in Practice (Lambda days)

I managed to get into Lambda days this year and got a chance to present my benchmarking talk. You can watch the video here and check out the slides.

Sadly the bunny video isn’t working in the recording 😥

You can see the slides here or at speakerdeck, slideshare or PDF.

Abstract

“What’s the fastest way of doing this?” – you might ask yourself during development. Sure, you can guess – but how do you know? How long would that function take with a million elements? Is that tail-recursive function always faster?

Benchmarking is here to give you the answers, but there are many pitfalls in setting up a good benchmark and analyzing the results. This talk will guide you through, introduce best practices, and surprise you with some results along the way. You didn’t think that the order of arguments could influence its performance…or did you?

The curious case of the query that gets slower the fewer elements it affects

I wrote a nice blog post for the company I’m working at (Liefery) called “The curious case of the query that gets slower the fewer elements it affects“, which goes through a real world benchmarking with benchee. It involves a couple of things that can go wrong but how combined indexes and PostgreSQL’s EXPLAIN ANALYZE can help you overcome it problems. It’s honestly one of the blog posts I think I ever wrote so head over and read it if that sounds interesting to you 🙂

Slides: Where do Rubyists go?

I gave my first ever keynote yesterday at Ruby on Ice, which was a lot of fun. A lot of the talk is based on my “Where do Rubyists go?”-survey but also researching and looking into languages. The talk looks into what programming languages Ruby developers learn for work or in their free time, what the major features of those languages are and how that compares to Ruby. What does it tell us about Ruby and our community?

Slides can be viewed here or on speakerdeck, slideshare or PDF

Abstract

Many Rubyists branch out and take a look at other languages. What are similarities between those languages and ruby? What are differences? How does Ruby influence these languages?

Surprises with Nested Transactions, Rollbacks and ActiveRecord

Lately I acquired a new hobby. I went around and asked experience Rails developers, whom I respect and value a lot, how many users the following script would create:

The result should be the same on pretty much any database and any Rails version. For the sake of argument you can assume Rails 5.1 and Postgres 9.6 (what I tested it with).

So, how many users does it create? No one from more than a hand full of people I asked got the answer right (including myself).

The answer is 2.

Wait, WHAT?

Yup you read that right. It creates 2 users, the rollback is effectively useless here. Ideally this should create one user (Kotori), but as some people know nested transactions isn’t really a thing that databases support (save for MS-SQL apparently). People, whom I asked and knew this, then guessed 0 because well if I can’t rollback a part of it, better safe than sorry and roll all of it back, right?

Well, sadly the inner transaction rescues the rollback and then the outer transaction happily commits all of it. 😦

Before you get all worried – if an exception is raised and not caught the outer transaction can’t commit and hence 0 users are created as expected.

A fix

So, what can we do? When opening a transaction, we can pass requires_new: true to the transaction which will emulate a “real” nested transaction using savepoints:

As you’d expect this creates just one user.

Nah, doesn’t concern me I’d never write code like this!

Sure, you probably straight up won’t write code like this in a file. However, split across multiple files – I think so. You have one unit of business logic that you want to run in a transaction and then you start reusing it in another method that’s also wrapped in another transaction. Happens more often than you think.

Plus it can happen even more often than that as every save operation is wrapped in its own transaction (for good reasons). That means, as soon as you save anything inside a transaction or you save/update records as part of a callback you might run into this problem.

Here’s a small example highlighting the problem:

As you probably expect by now this creates 2 users. And yes, I checked – if you run create with rollback: true outside of the transaction no user is created. Of course, you shouldn’t raise rollbacks in callbacks but I’m sure that someone somewhere does it.

In case you want to play with this, all of these examples (+ more) are up at my rails playground.

The saddest part of this surprise…

Unless you stumbled across this before, chances are this is at least somewhat surprising to you. If you knew this before, kudos to you. The saddest part is that this shouldn’t be a surprise to anyone though. A lot of what is written here is part of the official documentation, including the exact example I used. It introduces the example with the following wonderful sentence:

For example, the following behavior may be surprising:

As far as I can tell this documentation with the example has been there for more than 9 years, and fxn added the above sentence about 7 years ago.

Why do I even blog about this when it’s in the official documentation all along? I think this deserves more attention and more people should know about it to avoid truly bad surprises. The fact that nobody I asked knew the answer encouraged me to write this. We should all take care to read the documentation of software we use more, we might find something interesting you know.

What do we learn from this?

READ THE DOCUMENTATION!!!!

Are comments a code smell? Yes! No? It Depends.

Most people are either firmly on the “Yes!” or the “No!” side when it comes to discussing comments and their status as a code smell. But, as with most question worth asking the correct answer rather is an “It depends”.

I got to re-examine this topic lately triggered by a tweet and a discussion with Devon:

So, let’s start unwrapping these layers, shall we?

Important distinction: Comments vs. Documentation

One of the first points on the list is understanding what a comment is and what it is not. For me documentation isn’t a comment, in most languages (unfortunately) documentation happens to be represented as a comment. Thankfully some languages, such as elixir, Clojure and Rust, have a separate construct for documentation to make this obvious and facilitate working with documentation.

I don’t think everything should be documented. However, libraries definitely need documentation (if you want people using them that is). I’ve also grown increasingly fond of documentation in application code, especially as projects grow. At Liefery core modules have a top level “module” comment describing the business context, language, important collaborators etc. It has proven invaluable. One of my favorites is the description of the shipment state machine that for each state shortly summarizes what it means – keeping all those in your head has proven quite difficult. Plus, it’s a gift for new developers getting into the code base.

Of course documentation still suffers one of the major drawback of comments – it can become outdated. Much less so if documentation rather provides context than describing in detail what happens.

So, documentation for me isn’t a comment. Next up – what’s this code smell thing?

What’s a Code Smell?

In short a code smell is an indication that something could be wrong with this code. Or to let the creators of the term, Kent Beck (whose idea the term was) and Martin Fowler, tell it in Refactoring:

(…) describing the “when” of refactoring in terms of smells. (…) we have learned to look for certain structures in the code that suggest (sometimes they scream for) the possibility of refactoring.

Does this description fit comments? Well, comments made the “original” list of code smells, with the following reasoning:

(…) comments often are used as a deodorant. It’s surprising how often you look at thickly commented code and notice that the comments are there because the code is bad.

They go on to explain what should be done instead of comments:

When you feel the need to write a comment, first try to refactor the code so that any
comment becomes superfluous.

That is exactly in line with my view of code comments. There is so much more that you can do to make your code more readable instead of resorting to a comment. Comments should be a last resort.

To further explore this, let’s take a look at one of my favorite distinctions when it comes to “good” comments versus “bad” comments.

WHAT versus WHY comments

I like to think of comments in 2 categories:

  • WHAT comments describe what the code does, these can be high level but sometimes they also tell you every little thing the code does (“iterates over, then… uses result to”)
  • WHY comments clarify why some code is like it is giving you a peek into the past why a decision was made

Let’s start with the WHAT – what comments can almost always be replaced by more expressive code. Most of this has to do with proper naming and concepts, which is why it isn’t uncommon for me to spend an extended period of time on these. Hell, (coincidentally) Devon and I even spent hours on defining “Scenarios” in benchee.

Variables, methods, classes, modules… all of these communicate through their name. So spending a good time naming them helps a lot. Often it is also the right call to extract one of these to keep the line count small and manageable while naming the concept you just extracted to help the understanding of the overall code.

Let’s take a look at one of my favorite examples:

Let this stand in for every long method you ever came across where the method body was broken into sections by comments. Extract 3 methods, name them somewhat like the comments. Enjoy shorter methods, meaningful names, concepts and reusability.

I’ve even seen people advocating for this style of long methods with comments. Easy to say, I’m not a fan. The article says “The more complex the code, the more comments it should have.” and my colleague Tiago probably responded best to that:

You should make the code less complex not add more comments.

Another example I wish I made up, but it’s real (I only ported it from JavaScript to Ruby):

As a first step just rename your parameters to whatever understandable name was commented above (also how does l translate to time per step?). Afterwards, look for a bigger concept you might be missing and aggregate the needed data into it so you trim the number of parameters down.

All in all, a WHAT style comment to my mind is a declaration of defeat – it’s an “I tried everything but I can’t make this code be readable by itself” You can be sure, if I get there I first consult a colleague about it and if we can’t come up with something I’ll isolate the complexity and then be sad about my defeat.

With all of that about what comments, how about WHY comments?

They can help us with things that can hardly be expressed in code. Let’s take a little example from the great shoes project:

While the puts statements communicates some of it, it is important to emphasize how dangerous not rescuing here is. The comment also helps establish context and points to where one could find more information about this.

This is an excellent use case for a comment and thankfully Kent Beck and Martin Fowler agree (again from the Refactoring book):

A comment is a good place to say why you did something. This kind of information helps future modifiers, especially forgetful ones.

There is an argument to be made that such information should be kept in the version control system and not in a comment. It is true: the commit message should definitely reflect this, ideally with an easy to produce link both to the ticket and pull request. However, a commit message alone is not enough to my mind. Tracking down a commit that introduced a change in an older code base can be quite hard (ever tried changing all strings from single quotes to double quotes? 😉 ) and you can’t expect everyone to always look at the history of every line of code they change. A comment acts a warning sign in places like these.

In short: WHY comments “yay“! WHAT comments “nay“!

Context matters

Before we get to the final “verdict” there’s one more aspect I’d like to examine: the context of your application. That context might greatly influence the need for comments. Another CRUD application like the ones you built before? Probably doesn’t need many comments. That new machine learning micro service written in Python and deployed with docker while no one in your team has done any of these things before? Yup, that probably needs a couple of more comments.

New business domain, new framework, new language, something out of your comfort zone, experience level of developers – all of these can justify more comments to be written. Those can give context, link to resources, WHAT comments describing on a high level what’s going on and so on. For instance, our route planning code has quite a few more comments explaining the used algorithms and data structures on a high level than the rest of the code base.

Yadda yadda – are comments a code smell or not?

As already established – it’s not as black and white as some people make it seem. To get back to the original twitter conversation that started all this:

For a shorter answer, I think Robert Martin also puts it quite well and succinct in Clean Code:

The proper use of comments is to compensate for our failure to express ourself in
code.

What about me? Well, if you asked me “Are comments a code smell?” on the street the answer would probably be “Yes”, the better answer would be “It depends.” and the good answer short of this blog post would be something along the lines of:

There’s a difference between documentation, which is often good, and comments. WHY comments highlighting reasoning are valuable. WHAT comments explaining the code itself can often be replaced by more expressive code. Only when I admit defeat will I write a WHAT comment.

(these days this even fits in a single tweet 😉 )

edit: As friends happily pointed out, documentation is also a construct different from code comments in clojure and rust. Added that in.

Released: benchee 0.10, HTML, CSV and JSON plugins

It’s been a little time since the last benchee release, have we been lazy? Au contraire mes ami! We’ve been hard at work, greatly improving the internals, adding a full system for hooks (before_scenarion, before_each, after_each, after_scenario) and some other great improvements thanks to many contributions. The releases are benchee 0.10.0 (CHANGELOG), benchee_csv 0.7.0 (CHANGELOG), benchee_html 0.4.0 (CHANGELOG) and benchee_json 0.4.0 (CHANGELOG).

Sooo… what’s up? Why did it take so long?

benchee

Before we take a look at the exciting new features, here’s a small summary of major things that happened in previous releases that I didn’t manage to blog about due to lack of time:

0.7.0 added mainly convenience features, but benchee_html 0.2.0 split up the HTML reports which made it easier to find what you’re looking for but also alleviated problems with rendering huge data sets (the graphing library was reaching its limits with that many graphs and input values)

0.8.0 added type specs for the major public functions, configuration is now a struct so errors out on unrecognized options

0.9.0 is one of my favorite releases as it now gathers and shows system data like number of cores, operating system, memory and cpu speed. I love this, because normally when I benchmark I and write about it I need to write it up in the blog post. Now with benchee I can just copy & paste the output and I get all the information that I need! This version also facilitates calling benchee from Erlang, so benchee:run is in the cards.

Now ahead, to the truly new stuff:

Scenarios

In benchee each processing step used to have its own main key in the main data structure (suite): run_times, statistics, jobs etc. Philosophically, that was great. However, it got more cumbersome in the formatters especially after the introduction of inputs as access now required an additional level of indirection (namely, the input). As a result, to get all the data for a combination of job and input you want to format you have got to merge the data of multiple different sources. Not exactly ideal. To make matters worse, we want to add memory measurements in the future… even more to merge.

Long story short, Devon and I sat down in person for 2 hours to discuss how to best deal with this, how to name it and all accompanying fields. We decided to keep all the data together from now on – for every entry of the result. That means each combination of a job you defined and an input. The data structure now keeps that along with its raw run times, statistics etc. After some research we settled on calling it a scenario.

This was a huge refactoring but we really like the improvements it yielded. Devon wrote about the refactoring process in more detail.

It took a long time, but it didn’t add any new features – so no reason for a release yet. Plus, of course all formatters also needed to get updated.

Hooks

Another huge chunk of work went into a hooks system that is pretty fully featured. It allows you to execute code before and after invoking the benchmark as well as setup code before a scenario starts running and teardown code for after a scenario stopped running.

That seems weird, as most of the time you won’t need hooks. We could have released with part of the system ready, but I didn’t want to (potentially) break API again and so soon if we added arguments or found that it wasn’t quite working to our liking. So, we took some time to get everything in.

So what did we want to enable you to do?

  • Load a record from the database in before_each and pass it to the benchmarking function, to perform an operation with it without counting the time for loading the record towards the benchmarking results
  • Start up a process/service in before_scenario that you need for your scenario to run, and then…
  • …shut it down again in after_scenario, or bust a cache
  • Or if you want your benchmarks to run without a cache all the time, you can also bust it in before_each or after_each
  • after_each is also passed the return value of the benchmarking function so you can run assertions on it – for instance for all the jobs to see if they are truly doing the same thing
  • before_each could also be used to randomize the input a bit to benchmark a more diverse set of inputs without the randomizing counting towards the measured times

All of these hooks can be configured either globally so that they run for all the benchmarking jobs or they can be configured on a per job basis. The documentation for hooks over at the repo is a little blog post by itself and I won’t repeat it here 😉

As a little example, here is me benchmarking hound:

Hound needs to start before we can benchmark it. Howeer, hound seems to remember the started process by the pid of self() at that time. That’s a problem because each benchee scenario runs in its own process, so you couldn’t just start it before invoking Benchee.run. I found no way to make the benchmark work with good old benchee 0.9.0, which is also what finally brought me to implement this feature. Now in benchee 0.10.0 with before_scenario and after_scenario it is perfectly feasible!

Why no 1.0?

With all the major improvements one could easily call this a 1.0. Or 0.6.0 could have been a 1.0 then we’d be at 2.0 now – wow that sounds mature!

Well, I see 1.0 as a promise – a promise for plugin developers and others that compatibility won’t be broken easily and not soon. Can’t promise this when we just broke plugin compatibility in a major way. That said, I really feel good about the new structure, partly because we put so much time and thought into figuring it out, but also because it has greatly simplified some implementations and thinking about some future features it also makes them a lot easier to implement.

Of course, we didn’t break compatibility for users. That has been stable since 0.6.0 and to a (quite big) extent beyond that.

So, 1.0 will of course be coming some time. We might get some more bigger features in that could break compatibility (although I don’t think they will, it will just be new fields):

  • Measuring memory consumption
  • recording and loading benchmarking results
  • … ?

Also before a 1.0 release I probably want to extract more not directly benchmarking related functionality from benchee and provide as general purpose libraries. We have some sub systems that we build for us and would provide value to other applications:

  • Unit: convert units (durations, counts, memory etc.), scale them to a “best fit” unit, format them accordingly, find a best fit unit for a collection of values
  • Statistics: All the statistics we provide including not so easy/standard ones like nth percentile and mode
  • System: gather system data like elixir/erlang version, CPU, Operating System, memory, number of cores

Thanks to the design of benchee these are all already fairly separate so extracting them is more a matter of when, not how. Meaning, that we have all the functionality in those libraries that we need so that we don’t have to make a coordinated release for new features across n libraries.

benchee_html

Selection_045.png

Especially due to many great community contributions (maybe because of Hacktoberfest?) there’s a number of stellar improvements!

  • System information is now also available and you can toggle it with the link in the top right
  • unit scaling from benchee “core” is now also used so it’s not all in micro seconds as before but rather an appropriate unit
  • reports are automatically opened in your browser after the formatter is done (can of course be deactivated)
  • there is a default file name now so you don’t HAVE to supply it

What’s next?

Well this release took long – hope the next one won’t take as long. There’s a couple of improvements that didn’t quite make it into the release so there might be a smaller new release relatively soon. Other than that, work on either serializing or the often requested “measure memory consumption” will probably start some time. But first, we rest a bit 😉

Hope you enjoy benchmarking and if you are missing a feature or getting hit by a bug, please open an issue

 

 

Careful what you measure: 2.1 times slower to 4.2 times faster – MJIT versus TruffleRuby

Have you seen the MJIT benchmark results? Amazing, aren’t they? MJIT basically blows the other implementations out of the water! What were they doing all these years? That’s it, we’re done here right?

Well, not so fast as you can infer from the title. But before we can get to what I take issue with in these particular benchmarks (you can of course jump ahead to the nice diagrams) we gotta get some introductions and some important benchmarking basics out of the way.

MJIT? Truffle Ruby? WTF is this?

MJIT currently is a branch of ruby on github by Vladimir Makarov, GCC developer, that implements a JIT (Just In Time Compilation) on the most commonly used Ruby interpreter/CRuby. It’s by no means final, in fact it’s in a very early stage. Very promising benchmarking results were published on the 15th of June 2017, which are in major parts the subject of this blog post.

TruffleRuby is an implementation of Ruby on the GraalVM by Oracle Labs. It poses impressive performance numbers as you can see in my latest great “Ruby plays Go Rumble”. It also implements a JIT, is known to take a bit of a warmup but comes out being ~8 times faster than Ruby 2.0 in the previously mentioned benchmark.

Before we go further…

I have enormous respect for Vladimir and think that MJIT is an incredibly valuable project. Realistically it might be one of our few shots to get a JIT into mainstream ruby. JRuby had a JIT and great performance for years, but never got picked up by the masses (topic for another day).

I’m gonna critique the way the benchmarks were done, but there might be reasons for that, that I’m missing (gonna point out the ones I know). After all, Vladimir has been programming for way longer than I’m even alive and also knows more about language implementations than I do obviously.

Plus, to repeat, this is not about the person or the project, just the way we do benchmarks. Vladimir, in case you are reading this 💚💚💚💚💚💚

What are we measuring?

When you see a benchmark in the wild, first you gotta ask “What was measured?” – the what here comes in to flavors: code and time.

What code are we benchmarking?

It is important to know what code is actually being benchmarked, to see if that code is actually relevant to us or a good representation of a real life Ruby program. This is especially important if we want to use benchmarks as an indication of the performance of a particular ruby implementation.

When you look at the list of benchmarks provided in the README (and scroll up to the list what they mean or look at them) you can see that basically the top half are extremely micro benchmarks:

Selection_041.png

What’s benchmarked here are writes to instance variables, reading constants, empty method calls, while loops and the like. This is extremely micro, maybe interesting from a language implementors point of view but not very indicative of real world ruby performance. The day looking up a constant will be the performance bottle neck in Ruby will be a happy day. Also, how much of your code uses while loops?

A lot of the code (omitting the super micro ones) there isn’t exactly what I’d call typical ruby code. A lot of it is more a mixture of a script and C-code. Lots of them don’t define classes, use a lot of while and for loops instead of the more typical Enumerable methods and sometimes there’s even bitmasks.

Some of those constructs might have originated in optimizations, as they are apparently used in the general language benchmarks. That’s dangerous as well though, mostly they are optimized for one specific platform, in this case CRuby. What’s the fastest Ruby code on one platform can be way slower on the other platforms as it’s an implementation detail (for instance TruffleRuby uses a different String implementation). This puts the other implementations at an inherent disadvantage.

The problem here goes a bit deeper, whatever is in a popular benchmark will inevitably be what implementations optimize for and that should be as close to reality as possible. Hence, I’m excited what benchmarks the Ruby 3×3 project comes up with so that we have some new more relevant benchmarks.

What time are we measuring?

This is truly my favorite part of this blog post and arguably most important. For all that I know the time measurements in the original benchmarks were done like this: /usr/bin/time -v ruby $script which is one of my favorite benchmarking mistakes for programming languages commonly used for web applications. You can watch me go on about it for a bit here.

What’s the problem? Well, let’s analyze the times that make up the total time you measure when you just time the execution of a script: Startup, Warmup and Runtime.

Selection_043.png

  • Startup – the time until we get to do anything “useful” aka the Ruby Interpreter has started up and has parsed all the code. For reference, executing an empty ruby file with standard ruby takes 0.02 seconds for me, MJIT 0.17 seconds and for TruffleRuby it takes 2.5 seconds (there are plans to significantly reduce it though with the help of Substrate VM). This time is inherently present in every measured benchmark if you just time script execution.
  • Warmup – the time it takes until the program can operate at full speed. This is especially important for implementations with a JIT. On a high level what happens is they see which code gets called a lot and they try to optimize this code further. This process takes a lot of time and only after it is completed can we truly speak of “peak performance”. Warmup can be significantly slower than runtime. We’ll analyze the warmup times more further down.
  • Runtime – what I’d call “peak performance” – run times have stabilized. Most/all code has already been optimized by the runtime. This is the performance level that the code will run at for now and the future. Ideally, we want to measure this as 99.99%+ of the time our code will run in a warmed up already started state.

Interestingly, the startup/warmup times are acknowledged in the original benchmark but the way that they are dealt with simply lessens their effect but is far from getting rid of them: “MJIT has a very fast startup which is not true for JRuby and Graal Ruby. To give a better chance to JRuby and Graal Ruby the benchmarks were modified in a way that Ruby MRI v2.0 runs about 20s-70s on each benchmark”.

I argue that in the greater scheme of things, startup and warmup don’t really matter when we are talking about benchmarks when our purpose is to see how they perform in a long lived process.

Why is that, though? Web applications for instance are usually long lived, we start our web server once and then it runs for hours, days, weeks. We only pay the cost of startup and warmup once in the beginning, but run it for a much longer time until we shut the server down again. Normally servers should spend 99.99%+ of their time in the warmed up runtime “state”. This is a fact, that our benchmarks should reflect as we should look for what gives us the best performance for our hours/days/weeks of run time, not for the first seconds or minutes of starting up.

A little analogy here is a car. You wanna go 300 kilometers as fast as possible (straight line). Measuring as shown above is the equivalent of measuring maybe the first ~500 meters. Getting in the car, accelerating to top speed and maybe a bit of time on top speed. Is the car that’s fastest on the first 500 meters truly the best for going 300 kilometers at top speed? Probably not. (Note: I know little about cars)

What does this mean for our benchmark? Ideally we should eliminate startup and warmup time. We can do this by using a benchmarking library written in ruby that also runs the benchmark for a couple of times before actually taking measurements (warmup time). We’ll use my own little library as it means no gem required and it’s well equipped for the rather long run times.

But does startup and warmup truly never matter? It does matter. Most prominently it matters during development time – starting the server, reloading code, running tests. For all of those you gotta “pay” startup and warmup time. Also, if you develop a UI application  or a CLI tool for end users startup and warmup might be a bigger problem, as startup happens way more often. You can’t just warm it up before you take it into the load balancer. Also, running tasks periodically as a cronjob on your server will have to pay theses costs.

So are there benefits to measuring with startup and warmup included? Yes, for one for the use cases mentioned above it is important. Secondly, measuring with time -v gives you a lot more data:


tobi@speedy $ /usr/bin/time -v ~/dev/graalvm-0.25/bin/ruby pent.rb
Command being timed: "/home/tobi/dev/graalvm-0.25/bin/ruby pent.rb"
User time (seconds): 83.07
System time (seconds): 0.99
Percent of CPU this job got: 555%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.12
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1311768
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 57
Minor (reclaiming a frame) page faults: 72682
Voluntary context switches: 16718
Involuntary context switches: 13697
Swaps: 0
File system inputs: 25520
File system outputs: 312
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

You get lots of data, among which there’s memory usage, CPU usage, wall clock time and others which are also important for evaluating language implementations which is why they are also included in the original benchmarks.

Setup

Before we (finally!) get to the benchmarks, the obligatory “This is the system I’m running this on”:

The ruby versions in use are MJIT as of this commit from 25th of August compiled with no special settings, graalvm 25 and 27 (more on that in a bit) as well as CRuby 2.0.0-p648 as a baseline.

All of this is run on my Desktop PC running Linux Mint 18.2 (based on Ubuntu 16.04 LTS) with 16 GB of memory and an i7-4790 (3.6 GHz, 4 GHz boost).


tobi@speedy ~ $ uname -a
Linux speedy 4.10.0-33-generic #37~16.04.1-Ubuntu SMP Fri Aug 11 14:07:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I feel it’s especially important to mention the setup in here, as when I first did these benchmarks for Polyconf on my dual core notebook TruffleRuby had significantly worse results. I think graalvm benefits from the 2 extra cores for warmup etc, as the CPU usage across cores is also quite high.

You can check out the benchmarking script used etc. as part of this repo.

But… you promised benchmarks, where are they?

Sorry, I think the theory is more important than the benchmarks themselves, although they undoubtedly help illustrate the point. We’ll first get into why I chose the pent.rb benchmark as a subject and why I run it with a slightly old versions of graalvm (no worries, current version coming in later on). Then, finally, graphs and numbers.

Why this benchmark?

First of all, the original benchmarks were done with graalvm-0.22. Attempting to reproduce the results with the (at the time current) graalvm-0.25 proved difficult as a lot of them had already been optimized (and 0.22 contained some genuine performance bugs).

One that I could still reproduce the performance problems with was pent.rb and it also seemed like a great candidate to show that something is flawed. In the original benchmarks it is noted down as 0.33 times the performance of Ruby 2.0 (or well, 3 times slower). All my experience with TruffleRuby told me that this is most likely wrong. So, I didn’t choose it because it was the fastest one on TruffleRuby, but rather the opposite – it was the slowest one.

Moreover, while a lot of it isn’t exactly idiomatic ruby code to my mind (no classes, lots of global variables) it uses quite a lot Enumerable methods such as each, collect, sort and uniq while refraining from bitmaskes and the like. So I also felt that it’d make a comparatively good candidate from here.

The way the benchmark is run is basically the original benchmark put into a loop so it is repeated a bunch of times so we can measure the times during warmup and later runtime to get an average of the runtime performance.

So, why am I running it on the old graalvm-0.25 version? Well, whatever is in a benchmark is gonna get optimized making the difference here less apparent.

We’ll run the new improved version later.

MJIT vs. graalvm-0.25

So on my machine the initial execution of the pent.rb benchmark (timing startup, warmup and runtime) on TruffleRuby 0.25 took 15.05 seconds while it just took 7.26 seconds with MJIT. Which has MJIT being 2.1 times faster. Impressive!

What’s when we account for startup and warmup though? If we benchmark just in ruby startup time already goes away, as we can only start measuring inside ruby once the interpreter has started. Now for warmup, we run the code to benchmark in a loop for 60 seconds of warmup time and 60 seconds for measuring the actual runtime. I plotted the execution times of the first 15 iterations below (that’s about when TruffleRuby stabilizes):

2_warmup.png
Execution time of TruffleRuby and MJIT progressing over time – iteration by iteration.

As you can clearly see, TruffleRuby starts out a lot slower but picks up speed quickly while MJIT stay more or less consistent. What’s interesting to see is that iteration 6 and 7 of TrufleRuby are slower again. Either it found a new optimization that took significant time to complete or a deoptimization had to happen as the constraints of a previous optimization were no longer valid. TruffleRuby stabilizes from there and reaches peak performance.

Running the benchmarks we get an average (warm) time for TruffleRuby of 1.75 seconds and for MJIT we get 7.33 seconds. Which means that with this way of measuring, TruffleRuby is suddenly 4.2 times faster than MJIT.

We went from 2.1 times slower to 4.2 times faster and we only changed the measuring method.

I like to present benchmarking numbers in iterations per second/minute (ips/ipm) as here “higher is better” so graphs are far more intuitive, our execution times converted are 34.25 iterations per minute for TruffleRuby and 8.18 iterations per minute for MJIT. So now have a look at our numbers converted to iterations per minute compared for the initial measuring method and our new measuring method:

2_comparison_before_after.png
Results of timing the whole script execution (initial time) versus the average execution time warmed up.

You can see the stark contrast for TruffleRuby caused by the hefty warmup/long execution time during the first couple of iterations. MJIT on the other hand, is very stable. The difference is well within the margin of error.

Ruby 2.0 vs MJIT vs. graalvm-0.25 vs. graalvm-0.27

Well, I promised you more data and here is more data! This data set also includes CRuby 2.0 as the base line as well as the new graalvm.

initial time (seconds) ipm of initial time average (seconds) ipm of average after warmup Standard Deviation as part of total
CRuby 2.0 12.3 4.87 12.34 4.86 0.43%
TruffleRuby 0.25 15.05 3.98 1.75 34.25 0.21%
TruffleRuby 0.27 8.81 6.81 1.22 49.36 0.44%
MJIT 7.26 8.26 7.33 8.18 2.39%
4_warmup.png
Execution times by iteration in second. CRuby stops appearing because that were already all the iterations I had.

We can see that TruffleRuby 0.27 is already faster than MJIT in the first iteration, which is quite impressive. It’s also lacking the weird “getting slower” around the 6th iteration and as such reaches peak performance much faster than TruffleRuby 0.25. It also gets faster overall as we can see if we compare the “warm” performance of all 4 competitors:

4_comparison.png
Iterations per Minute after warmup as an average of our 4 competitors.

So not only did the warmup get much faster in TruffleRuby 0.27 the overall performance also increased quite a bit. It is now more than 6 times faster than MJIT. Of course, some of it is probably the TruffleRuby team tuning it to the existing benchmark, which reiterates my point that we do need better benchmarks.

As a last fancy graph for you I have the comparison of measuring the runtime through time versus giving it warmup time, then benchmarking multiple iterations:

4_comparison_before_after.png
Difference between measuring whole script execution versus letting implementations warmup.

CRuby 2 is quite consistent as expected, TruffleRuby already manages a respectable out of the box performance but gets even faster. I hope this helps you see how the method of measuring can achieve drastically different results.

Conclusion

So, what can we take away? Startup time and warmup are a thing and you should think hard about whether those times are important for you and if you want to measure them. For web applications, most of the time startup and warmup aren’t that important as 99.99%+ you’ll run with a warm “runtime” performance.

Not only what time we measure is important, but also what code we measure. Benchmarks should be as realistic as possible so that they are as significant as possible. What a benchmark on the Internet check most likely isn’t directly related to what your application does.

ALWAYS RUN YOUR OWN BENCHMARKS AND QUESTION BOTH WHAT CODE IS BENCHMARKED, HOW IT IS BENCHMARKED AND WHAT TIMES ARE TAKEN

(I had this in my initial draft, but I ended up quite liking it so I kept it around)

edit1: Added CLI tool specifically to where startup & warmup counts as well as a reference to Substrate VM for how TruffleRuby tries to combat it 🙂

edit2: Just scroll down a little to read an interesting comment by Vladimir