A talk about AlphGo and techniques it used with no prior knowledge required. Second talk of the Codemotion Berlin series, mostly the same talk I gave at Full Stack Fest. Something was cut/adjusted. A full recording from the Full Stack Fest version is available here.
This year AlphaGo shocked the world by decisively beating the strongest human Go player, Lee Sedol. An accomplishment that wasn’t expected for years to come. How did AlphaGo do this? What algorithms did it use? What advances in AI made it possible? This talk will briefly introduce the game of Go, followed by the techniques and algorithms used by AlphaGo to answer these questions.
Implementing my own AI – rubykon – in ruby of course isn’t going to get me the fastest implementation ever. It forces you to really do less and therefore make nice performance optimization, though. This isn’t about that either. Here I want to take a look at another question: “How fast can Ruby go?” Ruby is a language with surprisingly many well maintained implementations. Most prominently CRuby, Rubinius, JRuby and the newcomer JRuby + Truffle. How do they perform in this task?
Rubykon is a relatively small project – right now the lib directory has less than 1200 lines of code (which includes a small benchmarking library… more on that later). It has no external runtime dependencies – not even the standard library. So it is very minimalistic and also tuned for performance.
The benchmarks were run pre the 0.3.0 rubykon version on the 8th of November (sorry writeups always take longer than you think!) with the following concrete ruby versions (versions slightly abbreviated in the rest of the post):
JRuby 184.108.40.206 run in server mode and with invoke dynamic enabled (denoted as + id)
JRuby + Truffle Graal with master from 2015-11-08 and commit hash fd2c179, running on graalvm-jdk1.8.0
You can find the raw data (performance numbers, concrete version outputs, benchmark results for different board sizes and historic benchmark results) in this file.
This was run on my pretty dated desktop PC (i7 870):
tobi@tobi-desktop ~ $ uname -a
Linux tobi-desktop 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
tobi@tobi-desktop ~ $ java -version
openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
tobi@tobi-desktop ~ $ lscpu
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
CPU MHz: 1200.000
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
First benchmark: Simulation + Scoring on 19×19
This benchmark uses benchmark-ips to see how many playouts (simulation + scoring) can be done per second. This is basically the “evaluation function” of the Monte Carlo Method. Here we start with an empty board and then play random valid moves until there are no valid moves anymore and then we score the game. The performance of a MCTS AI is hugely dependent on how fast that can happen.
Benchmarks were run with a warmup time of 60 seconds and a run time of 30 seconds. The small black bars in the graph denote standard deviation. Results:
Second benchmark: Full UCT Monte Carlo Tree Search with 1000 playouts
This benchmark does a full Monte Carlo Tree Search, meaning choosing a node to investigate, doing a full simulation and scoring there and then propagating the results back in the tree before starting over again. As the performance is mostly dependent on the playouts the graph looks a lot like the one above.
This uses benchmark-avg, which I wrote myself and (for now) still lives in the rubykon repository. Why a new benchmarking library? In short: I needed something for more “macro” benchmarks that gives nice output like benchmark-ips. Also, I wanted a benchmarking tool that plays nice with Truffle – which means doing warmup and run of a benchmark directly after one another, as detailed in this issue.
This uses a warmup time of 3 minutes and a run time of 2 minutes. Along with the iterations per minute, we have another graph depicting average run time.
iterations per minute
average time (s)
JRuby 220.127.116.11 + invoke dynamic
JRuby + Truffle
Results here pretty much mirror the previous benchmark, although standard deviation is smaller throughout which might be because more non random code execution is involved.
Otherwise the relative performance of the different implementations is more or less the same, with the notable exception of JRuby 1.7 performing better than 9.0 (without invoke dynamic). That could be an oddity, but it is also well within the margin of error for the first benchmark.
For the discussion below I’ll refer to this benchmark, as it ran on the same code for all implementations and has a lower standard deviation overall.
The most striking observation certainly is JRuby + Truffle/Graal sits atop in the benchmarks with a good margin. It’s not that surprising when you look at previous work done here suggesting speedups of 9x to 45x as compared to CRuby. Here the speedup relative to CRuby is “just” 3.5 which teaches us to always run your own benchmarks.
At the same time I gotta say that the warmup time it takes has got me worried a bit. This is a very small application with one very hot loop (generating the valid moves). It doesn’t even use the standard library. The warmup times are rather huge exactly for Truffle and I made sure to call no other code in benchmark/avg as this might deoptimize everything again. However, it is still in an early stage and I know they are working on it 🙂
Second, “normal” JRuby is faster than CRuby which is not much of a surprise to me – in most benchmarks I do JRuby comes up ~twice as fast CRuby. So when it was only ~30% faster I was actually a bit disappointed, but then remembered the --server -Xcompile.invokedynamic=true switches and enabled them. BOOM! Almost 2.6 times faster than CRuby! Almost 90% faster than JRuby without those switches.
Now you might ask: “Why isn’t this the default?” Well, it was the default. Optimizing takes time and that slows down the startup time, for say rails, significantly which is why it was deactivated by default.
If I’m missing any of these magic switches for any of the other implementations please let me know and I’ll add them.
I’m also a bit sad to see rubinius somewhere between 1.9 and 2.2 performance wise, I had higher hopes for its performance with some appropriate warmup time.
Also opal is notably missing, I couldn’t get it to run but will try again in a next version to see what V8 can optimize here.
An important word of warning to conclude the high level look at the different implementations: These benchmarks are most likely not true for your application! Especially not for rails! Benchmark yourself 🙂
Now for another question that you probably have on your mind: “How fast is this compared to other languages/implementations?” See, that’s hard to answer. No serious Go engine does pure random playouts, they all use some heuristics slowing them down significantly. But, they are still faster. Here’s some data from this computer go thread, they all refer to the 19×19 board size:
it is suggested than one should be able to do at least 100 000 playouts per second without heuristics
With light playouts Aya did 25 000 playouts in 2008
well known C engine pachi does 2000 heavy playouts per thread per second
Which leads us to the question…
Is this the end of the line for Ruby?
No, there are still a couple of improvements that I have in mind that can make it much faster. How much faster? I don’t know. I have this goal of 1000 playouts on 19×19 per second per thread in mind. It’s still way behind other languages, but hey we’re talking about Ruby here 😉
Some possible improvements:
Move generation can still be improved a lot, instead of always looking for a new valid random moves a list of valid moves could be kept around, but it’s tricky
Scoring can also be done faster by leveraging neighbouring cells, but it’s not the bottleneck (yet)
a very clever but less accurate data structure can be used for liberty counting
also, of course, actually parallelize it and run on multiple threads
I could also use an up to date CPU for a change 😉
Other than that, I’m also looking over to the ruby implementations to get better, optimize more and make it even faster. I have especially high hopes for JRuby and JRuby + Truffle here.
So in the future I’ll try to find out how fast this can actually get, which is a fun ride and has taught me a lot so far already! You should try playing the benchmark game for yourselves 🙂
Go is a board game that is more than 2,500 years old (yes, this is not about the programming language!) and it is fascinating from multiple viewpoints. For instance, go bots still can’t beat professional players, unlike in chess.
This talk will show you what is so special about Go that computers still can’t beat humans. We will take a look at the most popular underlying algorithm and show you how the Monte Carlo method, basically random simulation, plays a vital role in conquering Go’s complexity and creating the strong Go bots of today.