Code Refractor - Virtual Machines/Compiler performance musings: 2014

Monday, December 29, 2014

CR is fully stalled, new things to come in 2015

As New Year resolutions appear I want to state publicly my interests and my willing to work into my free/OSS projects.

I had working for two projects in the last year: CodeRefractor and at the end of the year to a more interesting (details will follow later) Java game.

This other project is JHeroes 2 a port of FHeroes 2 to Java. FHeroes 2 is a C++ project on top of SDL library and the main advantage of using Java for JHeroes 2 was to make it run in high resolutions laptops with weak CPUs (like mine :) ). For this reason I had as advantage Java's JavaFX technology, which is similar with WPF from .Net world.

I succeed to load a level, graphics from original Heroes 2 game (by porting FHeroes 2 core code to Java) but when it comes to displaying, the frames are visibily higher (FHeroes 2 in 1920x1080P has like 1-2 FPS on Windows, when JHeroes 2 has in my specific testing around 20 FPS) but the CPU is to high! I mean it uses more than 100% of one CPU even I don't do any separate background thread. Profiling I found that Java does create a separate thread in JavaFx and all commands go to this thread and I think this thread creates OpenGL commands which are executed back on the main thread. The bad part is that CPU usage is really high, and there is no way to put sleeps or anything to reduce this CPU usage. Or at least not obvious way to do it.

So I did what every other developer would do if it doesn't want to spend another 1 year of his free time (to get here I spent my 2 months of free time) to see that it doesn't work as intended at the end, I did test similar technologies using a VM and the single two technologies that do seem to work well enough (on my machine) are:
- OpenGL using OpenTK
- Mono + Cairo on Linux (on Windows it runs at around the 10-20 FPS, when on Linux I get between 30-60 FPS but also I have the possibility to put sleeps as all code runs on the main thread)

So, based on this results as I don't want to rewrite all my code to use OpenGL and from Java to rewrite it into C# (maybe in future I will have this will, but for now not so much), I also suspended my work on JHeroes 2.

Also I want to thank for all contributors to CR and for JetBrains offering a free license of Resharper for supporting OpenSource.

So as New Year resolution, I hope to do something else, maybe I will look into small items to improve on CR (very unlikely) or I will want to continue a Mirah like language implementation for .Net (named Calcium, but more on this I will write later).

But what can I say about 2014 is that I found some reassuring things which make work on CR to not be necessary:
- Microsoft did opensource a lot of its stack
- it has a "native compiler" which will be always better supported than CR will ever be
- Mono matured also a lot in the last year. As for me it still doesn't have full optimizations (CR still have in my understanding some optimizations that neither .Net or Mono has: especially Escape Analysis, "pure function evaluation", and this like that) but using LLVM backend, especially on Linux is really a great platform

Saturday, December 6, 2014

How to speedup your cold application startup?

This entry is only about how to do practical solutions to improve startup especially on virtual machines code.

One of the heaviest things that start with your application is the virtual machine itself. Your VM will extract JAR or Assemblies and will scan them for integrity, will compile the hot parts and so on. So the easiest thing to do is either to: put a splash as fast as possible and let in the background time to download all these files. Also, you can use NGen and in future Java's Jigsaw or Excelsior JET or for very courageous to use CodeRefractor.

But this entry is not about this still. Let's say your application starts (at least most of the code is in memory), but users still notice lag because you process XMLs or you have some UI, and processing all the log is really expensive, or sending data over the wire is expensive. So, in short; your application seems not to be responsive.

So right now I want to stress what to do to make your application start faster.

I. "Index and merge everything"

I. 1. Merge XML data

In my experience you have very often data that repeats itself. Especially if you serialize a lot of items (like a database table that you send through the wire over a web service, but is also true on disk), many items repeat.

Even for cases when you unlikely see a line and repeated patterns, data repeats itself, for example in this file: http://upload.wikimedia.org/wikipedia/commons/f/fd/Ghostscript_Tiger.svg

This is a section of the SVG code (around the big green eye):

<g id="g242" stroke-width="0.5" stroke="#000" fill="#FFC"></g>
<g id="g246" stroke-width="0.5" stroke="#000" fill="#FFC"></g>
(...)
<g id="g262" fill="#e5e5b2"></g>
(...)
<g id="g290" fill="#FFF"></g>
<g id="g294" fill="#CCC"></g>
<g id="g298" fill="#000">
<path id="path300" d="m94.401,0.6s-29.001,12.2-39.001,11.8c0,0-16.4-4.6-24.8,10,0,…0,0,4-11.2,4-13.2s21.2-7.6,22.801-8c1.6-0.4,8.2-4.6,7.8-7.8z"></path>
</g>
<g id="g302" fill="#99cc32">
<path id="path304" d="m47,36.514c-6.872,0-15.245-3.865-15.245-10.114,0-6.248,8.373….446,5.065,12.446,11.313,0,6.249-5.572,11.314-12.446,11.314z"></path>
</g>

You can see that a lot of "g" tags, a lot of "id" flags, and a lot of "fill" elements and attributes are all over the XML. If you make your tree data-structure that keeps a list of strings (and a dictionary from these strings to the index of data), will make a lot of data smaller.

So instead of a:
class XmlNode { Name, InnerText: String;
Attributes: Dictionary<string, string>();
Children: List<XmlNode>();
}
Define:
class Texts { Data : List<string>();
QuickIndex: Dictionary<string, int>();
}
class XmlNode {
Name, InnerText: Integer;
Attributes: Dictionary<int, int> (); //it can be just an array of integers, which every 2 items, means key/value
Children: List<XmlNode>();
}

And to get any text, you will do soemthing like:
string name = texts.Data[xmlNode.Name];

It sounds probably silly at first and overly verbose (of course it can be wrapped with a lot of design patters, here is to show the principle at work!).

Someome may think that: "it doesn't matter, "g" as a string is just 1 character long so we waste 2 bytes because a string is Unicode", but in reality the memory usage for small strings is much much bigger. A string contains the length (4 bytes), the hash (+ 4 bytes), and some implementations add an extra null finishing character. So minimum string of 1 character length is in fact around 12 bytes. Even more, the object ovherhead is if I recall right of 16 bytes. In Java the memory is padded at 8 bytes, so even if you have 12 + 16 = 28 bytes, in fact you will use 32 bytes. If you use an index, for every duplicate you trade an integer (4 bytes) for 32 bytes.
Look for EXI format and OpenEXI implementation.

I. 2. Merge tree like data

This is in a way a similar construction, and you can use the same tricks using a tree of data, but this can be useful not so much on startup but minimizing the memory usage later,

The solution is to use Reflection and to use a shared dictionary of strings, and every time you find a string that is already used, to replace with the previous one. This patch improves hugely a parse tree of NRefactory, but the code can be changed minimally for any other data structures: https://github.com/icsharpcode/NRefactory/issues/448
(NB: the patch was not accepted upstream, but the idea remains true).

I.3. Merge pixels using palettes

20 years ago the pictures were using a palette of 256 colors (sometimes just 16). This by default merges a lot of data, in fact just if you have more than 20x20 pixels (which is virtually every picture you may use) you will save bytes of memory and disk.

II. Pack everything

Packing of data is know for taking a long long time and is used in special for big data, but in fact, when did you do the last measurement?

As I've wrote some months ago, I'm trying to improve my Java skills porting a C++ application to Java (a FHeroes 2 OSS clone).

Heroes 2 has two files which are basically 2 big indexed files, one named "Heroes2.agg" and "heroes2x.agg" where there are included all the assets. The original "Heroes2.agg" is like 40 MB on disk, and after I compressed using Gzip (which is included in almost any virtual machine, and I found it included also with Qt), it reduces to 20 MB.

So, how much does it take to read from the SSD the unpacked file, and how much would it take to read the index, and unpack all indivudual items which are packed?

On my machine (an fairly old i5M first gen) will take 140 ms for reading a 40 MB file (so, at least in this "benchmark" the disk would read theoretically 285 MB/s, but using the packed file and unpacking all the items (the logic of the game allows to unpack them only when needed, so it may be faster, but I wanted to compare as much as possible apples with apples) is 100 ms. This is almost a 30% time reduction to load all the assets that may be needed at least in this old game. Or counting it differently, it reads the original data at a speed of 399 MB/s.

III. Combine them

When you give an application to your customers or in general when you have your own applications that people want to think about them that they are snappy, make sure you optimize startup by increasing the CPU load (by calculating what data is common) and read fast this merged data and a big chunk of indexes to the original data. If you have a big data set, use any compression at save time and at startup do decompression.

The same principle happens to be the same for 3D objects (where most of 3D coordinates are shared: vertex, normals, texture coordinates, etc.).

In fact the more you merge the data and use binary data you will be surprised how fast will your application go. Mostly because even on SSD (but it is even worse on "traditional" hard drives) the disk is the slowest component by far, so optimizing it, at expense of CPU is the best way to wait less when you run your code.

IV. Conclusions

These solutions are practical but they should be used only when you measure at least before and after to see if there is a speedup. Maybe before optimizing your image loading algorithm, maybe you should not load the picture at all! But if you load it and you don't care in absolute terms if your picture uses just 256 colors.

It is also important to not forget about hashing if you have a huge repository of assets. This is a different talk altogether, but sometimes not the loading of data is expensive but the disk seek of it. If this is the case, try to split your data using hashing and dictionaries. In simpler terms, if you never created hashing do just this: create 3 level directories based on the first three letters. So: "asbest.pic" should be under folder: a/s/b/asbest.pi This can reduce greately the search overhead of your application, as the OS will have to search much less if a file/folder really exists.

Saturday, November 29, 2014

Why Static Code Analysis is important... and why we don't care

Static code analysis as of today is crucial as the software today is much bigger than ever. It is even more important as you have many people considering better code based on their previous experience, but they don't agree what means better.

So, based on this, is very likely that a static code analysis tool will help you a lot to see these errors, and these errors are found early, sometimes in a live session (like: Resharper tool - also written as R# - a very great extension of C# from JetBrains).

But what is the main issue of companies not using them (or not at a big scale)? I found very often (in my usage of R#, but I think is similar for many other companies and tools) more systemic causes:
I - (some) developers don't like to be nagged with a lot of errors/warnings when they want to finish the task fast. Having variables not named consistent or the namespace doesn't follow the location on disk is less of a concern. Even more, some libraries are for example with deprecated fields, and tools like R# will warn you that the code is broken (or at least using obsolete fields) which is a distraction given that users will not change libraries which are "legacy"
- these tools offer solutions which sometimes are at best strange or even harder to be understood for the user who doesn't understand/accept the technology. For example, Linq syntax in C# is weird for some users who don't know it, and even applying best coding practices, it makes your code less prone with side-effects errors, but the developers need to know Linq
- there is a cost upfront you have to make that these tools do not complain (much), but some companies focus on: let's implement this feature, then let's implement the other feature, but when you say: "I want to clean the code based on a tool" is a very hard sell. Is more likely a question of: "is this code working better/faster after this?" And answering honest (like: "no, it will make the code just more readable and less likely to have bugs in feature") will not change the decision.
- they are fairly costly - price is always relative, but if you take a license for every developer, things get expensive really fast. And there is no visibile advantage. Sometimes the tool is an easier sell if you say: "we need a better refactoring tool", and management may understand it (as R# is a very good refactoring tool), but otherwise is hard sell for management.
- small codebases very unlikely need these tools, the bigger the codebase is, the likely you will need these kinds of checks. The reason is that in small codebases you can put all the tool does in memory, and even you can forget a null pointer, this will be found early. If you need a tool that for example processes a batch of pictures and makes the a document, you may need very little refactors, but most likely is enough to use just default Visual Studio's facilities.

As for me, when I was a single freelance developer and I was paid (mostly) per finished feature, a license of R# or Visual Studio was very justified, as full navigation of Resharper and not wanting to look for all weird bugs in my code, was a understandable price I have to justify (for myself).

But later, when I worked in bigger companies, I did not see people using R#. The closest to R# is NRefactory, which brings a similar (but much buggier) experience and with many many limitations. So if you don't mind to work with a much buggier experience of Xamarin Studio or SharpDevelop, please go for it, at least to know what is this static code analysis about!

What I suggest to you anyway is to use a static code analysis, not at least because is easier to let a tool to look if you forget a null pointer check, than to you to do it, or even worse the user to notice it on their machine.

Monday, November 3, 2014

Poor Man's Parser in C#

I am sure that anyone dreaming about writing a compiler will look at least once into parser-generators or things like it. But some may find it daunting and the worst of it, some of them need to have very hard to debug and setup.

So how to write your own parser - which is easy to maintain/write.

Ok, so first of all, please don't blame me if you care about performant parser engine or you want to to do the "correct way". Also, this strategy I found it to work with somewhat simple languages (let's say Pascal, MQL (MetaTrader's trading language), C, and things like it.

The parser has the duty to make a tree based on your tokens.

Steps:
- write a lexer that gives a list of tokens. You can start from this implementation. It should be very easy to change it for example that some IMatcher-s to accept specific words
- create a Node in a tree with arbitrary number of children as following:
    public class BlockAstNode {
        public List<BlockAstNode> Items = new List<BlockAstNode>();

        public TokenDef InnerToken { get; set; }
        public bool IsTerminal { get; set; }
        public BlockAstNode()
        {
        }

        public BlockAstNode(TokenDef token)
        {
            InnerToken = token;
            IsTerminal = true;
        }

        public BlockAstNode BuildNonTerminal(int startRange, int endRange)
        {
            var result = new BlockAstNode ();
            for (var i = startRange; i <= endRange; i++)
            {
                var node = Items[i];
                result.Items.Add(node);
            }
            Items.RemoveRange(startRange, endRange- startRange+1);
            Items.Insert (startRange, result);
            return result;
        }
}
- fill a "tree" with one root and as children create AstNode children for every token (using the constructor with TokenDef, where TokenDef is your token class)
- make a bracing folding. Most major languages support a folding strategy. In Pascal is: "begin .. end", in C is { to } or for Ruby/VB: <class> to <end> or <def>..<end>. Every time when you find a block that starts with your begin brace token, you fold it using BuildNonTerminal method, and the result node is placed instead of the full node. Try for simplicity to find the smallest block that can be folded, and repeat until you cannot fold anymore.
- add more folding strategies for common cases: "fold all comments tokens"
- add fold for reserved words tokens: class, while, if, (interface/implementation) and they are separated iterations that visit the tree recursively and simplify the major instructions

Why this works better than regular parser generators?
- every strategy is very easy to debug: you add multiple iterations and if you have a crash in any of them, you will likely have a mismatched index (with +/- one, but no ugly generated code)

Enjoy writing your new language!

Monday, October 13, 2014

Java Optimizer - a primer

As I've posted the previous blog entry about Java running fast (at least for a mini benchmark: to rotate a picture) a great feedback was appearing in comments from user named V.

He exposed one problem of my methodology, the fact that GCC behaves better on Linux: he gets like 151 ms on Windows (also my testing platform) and a great time of 114 ms (on 32 bit) on Linux. I have to say: this is great news (at least for GCC users on Linux)! Java gave a runtime of executing the same task like 130 ms on his machine (regardless on the OS).

So, as a introduction: why Java gets so good numbers even it compiles the code every time it reads the "bytecode" file and it "interprets" all the code?

Java in a very simpler form is having this diagram:

Don't worry of so many steps but the main thing to know is that when you "execute" Java, in fact there is a program, named Java Virtual Machine which reads all the instructions your program have (named Java bytecode) and is packed typically as a Jar file (or many .class files).

In the upper diagram is also shown some part which in fact doesn't run anything but makes sure your program is correctly formatted.

So what happens when you try to write something like Hello world in Java?
1. you write your program.java,
2. you compile it using javac as a result will get one .class file (which you can pack as .jar if you want) which will describe your original program as stack operations (like instead of a = 2+3, will be rewritten as: "push int 2" "push int 3" "add int" "loadint into "a" "
3. you say to java.exe (on Windows, will be just java on Linux/OS X) to "run" your .class file
4. The java.exe will validate first if your .class is well formatted
5. Java will interpret (meaning will run all the .class instructions one by one) up to a threshold is met (typically is 10.000 "loop edges" meaning 10.000 times when the same sequence of instructions is interpreted)
6. if your hot code touches this threshold will compile it using something named like C1 and C2 compilers (also known as client and server compilers). The main differences (which I found presented publicly) between C1 and C2 is that C2 makes more aggressive inlining and C2 also uses a slower technique to allocate registers. On 64 bit Java, C2 compiler is enabled by default, and from Java 8 the "tiered mode" is enabled by default (for me is not clear yet if is for both 32 or 64 bit or just 32 bit, I never tested it) which it means that some "hottier" code is compiled with C1 then is compiled with C2.

So the "tricks" to get fast code in Java are:
- make as many repetitions as possible of hot code: don't run many branches that they get "warmy" but not "hot enough". They will not compile.
- write the code that doesn't do useless stuff in that loop
- if you need really better performance, use C2 (-server) or Tiered mode. They are enabled both by default on JVM from Java 8, so just updating your JVM you get good defaults on performance
- don't execute a lot of method dispatches (also known as virtual calls): try to make your hot loop to use just one class: C2 will likely inline these calls correctly. Sometimes C2 inlines them speculatively, so you can get great performance even you don't care about virtual calls.

Some big differences that Java get (which can show even more speedup than C++ can get) is "over the object file boundaries" meaning that in some cases C/C++ code runs slower because C++ cannot inline from one object file to another (right now is possible to some extent) but certainly is impossible to inline a "hot code" from an Dll/libSO. Even more important, is that the case where Java keeps at runtime always the list of loaded classes, and even you use virtual methods everywhere, if you don't have any class that overrides your class, the calls will be always dispatched as early binded (not-virtual) calls.

One very interesting talk which discusses the optimizations that Java does, is this Gil Tene's one: is a great way to understand what most optimizing compilers do, so is a great talk even you are using C++ (at least the first 1/2 half is showing typical optimizations that CR or .Net or any C++ compiler for the last 10 years will do for you!).

At last, but not at least, Java has some performance pitfalls, some which happen to be also C++ pitfalls:
- CPU stalls: when your data does not fit nicely into CPU cache your code will run typically 10 times slower (or even more for lower memory bandwidth CPUs like ARMs). Yes, you read correctly: not 10% slower, but 10 times slower. Making your algorithm to read and write into consecutive addresses in memory will run much faster. So when you write your program try to think like the hottest of the hottest code to keep data into 256 KB of data both for reads and writes. Java makes it a bit faster as it doesn't have the equivalent of "value type" so an array of 3D coordinates is in fact an array of individual pointers to separate memory structure. For this reason, I would recommend to pack coordinates into arrays and have an index as a separate class to do your processing. But look into this problem: this can hurt your C++ performance also!
- Java has a startup lag and you still have to wait for Java to warm up: this means that is not ideal for games, but there are tricks without tricking Java to run well enough for games: start your game engine as the first thing. This is in my knowledge what Minecraft does (I don't play it, sorry :( ). This let JVM to learn for the first seconds what is your code does and gives a chance to compile most of what your game will do
- some Java APIs (mostly web services) tend to be "fat" meaning you have to process XML or JSon to read your settings, or to communicate or things like this. I recommend to use faster replacements for calling remote data (without optimizing yourself the dataset). Instead of sending raw Xml (or even saving it on disk), use OpenEXI and to send data over the wire, if you have a more complex data-set, I found KryoNet to run very fast. This is also true for C++ coding, but is a more common problem in the "Java mindset" as far as I've noticed.
- Java gives to you mostly the same performance than a fairly good C++ code (maybe a bit slower on Linux but not as much as you should worry, as V showed to me) but you have to improve the other slow parts: don't be smart on writing your own threading solution (is fairly common in C++ world to rewrite everything) but use as much as possible existing solutions: JavaFx has accelerated graphics for UI for example and is excellent for most UIs. If you find it to be slow as it parses a large XML for UI, create just the small UI parts which are expensive by hand, but don't go over-board! I found for example in jHeroes2 that the original game file named Heroes2.agg is like 40 MB, but is it faster to read it, compress every individual item (this big file is a collection of small files), make it like: Heroes2.opt (of 20 MB file) and reading+unpacking is faster (at least on my machine) than reading the 40 MB. This is also because there is such of a huge difference between CPUs and disk speeds.
- The last but not least is the GC: Java's allocator is much faster than C/C++ heap allocator, but is called for almost every object. C/C++ and C# can allocate them on stack (so they "free" on exit of the function automatically) So try that most objects you allocate to not keep trivial data, but also don't keep them to long in memory. The "long pause" (named full GC) in Java is always proportional of how much data you have at the moment the GC triggered. If you find that you have to many pauses, try to double/triple the memory of your application and the GC should call at most half of the times, or use G1 an adaptive memory collector. If you run games, I would expect you will not allocate (excluding very trivial amount of very short-lived data as key strokes or, so the full GC is never called.

Hope this blog entry was useful and I hope (as I may not add many things to CR in the close future) to add some other items like this. Would you find (the reader) useful to see some Youtube entries about "how to" programming tutorials? Would you follow them?

Sunday, October 12, 2014

A case for Java on Desktop (with a benchmark)

When I started CodeRefractor, Windows 8 was somewhat just released, and .Net looked an unstable platform to invest in (for reasons like: Xaml/WPF team was disolved, or Metro runtime being a new way to get locked in into Microsoft's garden, or Xamarin, but not get me started there also).

In the last year many of the shortcomings were fixed but mostly from Xamarin's team and the related projects: MonoGame runs on most platforms nicely, Xamarin added some optimizations (which were part of CR :) ) and made the performance story better, but also even the Microsoft's story is much better in the .Net world (Ryu JIT, an Ahead-of-Time strategy which would give a similar performance profile with CR when finished).

So why run Java on desktop? Here is my top list of today why you should consider Java to C++ or C# on desktop:
- all major IDEs of Java are free of charge (IntelliJ Idea, Eclipse and Netbeans)
- the framework of UIs (JavaFx) is decent and has all the features you would expect on a top-notch UI framework including: CSS styling, GPU acceleration
- Java fixes with Java 8 two of the biggest coding/behavior issues: lambdas make working with UI events, threads etc. sane! The language is very repetitive for my taste, but it makes it very readable and the removal of PerfGen (a nifty bug-by-design in my view)
- Java 8 added Stream API, which as for me the biggest feature is adding of a simple way to write parallel code.

So fine, if you grant me that Java works mostly as language/frameworks as almost all languages of year 2010, why not pick still all other languages? As for me: cross platform and top-notch performance are really huge features. To give a Jar and it is supported on all platforms is a huge deal.

So I made claims about performance, isn't it Java slow? Especially on desktop? I would say: "depends" and "if you know how Java works, Java runs fast".

So I did this weekend a simple Qt application and I want to do something that compilers like C++ do know how to optimize it: a rotation of a picture by 90 degrees (reference hardware: Core i5M-540M, Win64, GCC MinGw 4.8.1-32 bit, Java 8.0 u20 64 bit)

This is the "hottest of the hottest" loop that would rotate a picture in C++ (without using assembly):
void MainWindow::rotatePixelData(int* __restrict src, int* __restrict dest, int width, int height){

    int newWidth = height;

    int newHeight = width;

    for(int x=0; x<width; x++) {

        for(int y=0; y<height; y++) {

            int newX = height-1-y;

            int newY = x;

            dest[newY*newWidth+newX] = src[y*width+ x];

Best time of rotating a 4K picture is: 144 ms (the first time is like 161 ms).

The Java code is the same just is not using "__restrict":

static void rotatePixelData(int[] src, int[] dest, int width, int height) {
        int newWidth = height;
        int newHeight = width;
        for (int x = 0; x < width; x++) {
            for (int y = 0; y < height; y++) {
                int newX = height - 1 - y;
                int newY = x;
                dest[newY * newWidth + newX] = src[y * width + x];
            }
        }
    }

Typical time: 126 ms (the worst, yet not the first 146 ms).

Of course, without wanting to prove myself that I show bad benchmarks, or that I didn't tune the compiler enough, I wanted to make a difference significant from Java:

public class RotatorParallel {
    static List<Integer> buildSteps(int steps){
        List<Integer> result= new ArrayList<>();
        for(int i = 0;i<steps;i++){
            result.add(i);
        }
        return result;
    }
    public static void quickerRotatePixelData(int[] src, int[] dest, int width, int height) {
        int newWidth = height;
        List<Integer> steps = buildSteps(height);
        steps.parallelStream().forEach((Integer yBoxed)->{
            int y = yBoxed;
            for (int x = 0; x < width; x++) {
                int newX = height - 1 - y;
                int newY = x;
                dest[newY * newWidth + newX] = src[y * width + x];
            }
        }
        );
    }
}

This code is clearly longer, and it is more contrived, and what it does, is splitting the computation per row and using the parallel Streams API I've told earlier: it will run using all cores.

The most times are 62-64 ms and the slowest of the slowest are 70 ms. (so is exactly 2X speedup)

Some things to consider:
- C++ code is using 32 bits (yet I optimized as much as I could without using assembly). Java is using the same Jar and by default I was using 64 bit. This is a case where you guarantee that with no setup Java gives better numbers overall. If you compile on 64 bit, your code will not run on machines with 32 bit CPUs. I would expect that the code will run close to Java but not faster than 5 ms (but is also possible that it runs slower).
- Java optimizes based on hot code, and to rotate a picture is hot code
- using Stream API has an overhead, so split the processing into bigger chunks if you make it parallel. The most of overheads are with merging data (if you do coding like map-reduce) and some "invisible" boxing, like even the shown sample, the chunk is a row of image, and the overhead of boxing of a simple "int" type is there written in an explicit form
- the code is not CPU cache friendly: if you will split the picture into chunks (like sqsuares of 128x128 and you rotate them, I would expect that both Java and C++ will get better by a factor of 2, but what will not change the relation of Java running slower/faster).
- having multiple cores will make the code faster by the numbers of cores. If you use C++, you can criticize me with: why didn't you use OpenMP, and I can say, yes excluding that OpenMP is not working on all compilers of today (I'm talking here mostly about missing it in Clang/Mac OS X). So you have to make likely your mini-multi-core framework to be cross platform.

Still I think that Java is not optimal for deploy (meaning: you have to ask users to install Java runtime) and developer experience has to improve (hopefully it will improve with the next Java (named 9) by using "Project Jigsaw" and "sjavac"). As for me Jigsaw (which splits Java runtime into modules) is what have to be added in the next release as at least in the public presentations they predicted that the startup of the Java applications will improve still. And this is a big deal in the impression that Java is slow.

Friday, October 10, 2014

Don't trust (VM) benchmarks and how to spot bad benchmarks (II)

Here is somewhat a more realistic yet very bad benchmark:
http://youtu.be/C3-p-Txnr4Q

In fact this is in a way the best benchmark you can get:
- it is presented the same sourcecode
- it is running on the same device

This benchmark is simply bad because it doesn't explain to you what it compares to nicely.

So how can you run the same code in two environment (like 2 compilers) and get 3 times slow down: when you don't compare the same thing. And here is why:
- the first contender is: the Xamarin "VM" is in fact IKVM running on top of Mono which runs in top of LLVM.
- the second contender is: RoboVM which runs on top of LLVM

So why such of a great difference? There are more reasons why: IKVM translates Java bytecodes into a translated CIL (MSIL) bytecodes and Mono's optimizer is running this bytecode through few passes and after this it gives it to LLVM. These translations do add overhead as they are not thought one for each other. The biggest hit is on IKVM to CIL layer I would predict.

How RoboVM works: RoboVM takes Java bytecode and optimizes them using an academic framework (named Soot) which at the end outputs a faster version of the original Java bytecode and is outputting the same thing that LLVM wants/expects (SSA form).

In simpler terms: Xamarin is "quick interpreter" of Java byteocde, when RoboVM is an optimizing compiler of the Java bytecode. And at the end, is to be expected that Xamarin runs slower. Is not because Mono runs slow code (even it runs a bit slower than .Net, it produces fairly decent code), but is very likely that IKVM does not produce 1:1 code to Mono platform.

Understanding your compiler/platform is critical to make your code to run fast, not on the assembly level, but what the code expects and you can get really nice numbers.

To be clear: RoboVM is the best way to run Java on iOS by a wide margin, but I would say that if your decision is to pick for example RoboVM or Xamarin (and the licensing costs of Xamarin are not scarring you :) ), and you want to run RoboVM because it runs faster, you will be likely wrong. Write C# on Xamarin and you may likely run (a bit) faster than RoboVM.

I will make a future post why Java I think has a great opportunity in (near) future for desktop, performance wise, but I will will make a post how to create fast code running Java and things like this.

NB. Is any reader running Pi at home? Or any similar card-sized ARM based desktop. Can you run a small benchmark if I write for you some code? Contact me on Google Plus if yes. I am really curious how it perform when I will write the next post.

Thursday, October 9, 2014

Announcing my next F/OSS project

And it is gonna be written in Java.

Please if you are interested either in Heroes 2 or Java, please follow updates into this new F/OSS project I've started to contribute to.

Tuesday, October 7, 2014

Don't trust (VM) benchmarks and how to spot bad benchmarks

Introduction

Around year 2000 the computing was changed drastically, even more with the launching of multi-core era around year 2005 and after. Right now, we are in another "revolution" using "computing". And every time you will see new benchmarks where a generation is 25% faster (like A8 is faster than A7) or 100% (like A7 is "up-to" 100% faster than A6), and so on!

But as there are many whereabouts in real life, benchmarks are critical to show if you can have a good experience for users:
- you want your image effect to finish fast. Even more when there is a movie to be processed and a movie as is a huge collection of photos, you care about the language is used and the time your animation is finished
- you want all cores to be used especially when 4 cores look to be the norm today or at least 2 cores with hyperthreading, so you want that all your hardware to be used
- you want to save battery for your mobile devices
- and many other reasons where when things go fast, you will feel happier than if things go slow

What you should care about

But when you as an user may want all these, sometimes you miss the mark. I simply mean that when you do your tasks, you may look in the wrong direction:
- the disk is many many times slower than memory. So let's say you have to make as fast as possible a picture you've just received on Skype from a friend and he/she asks you to use Sepia tones. Starting Photoshop (or Gimp) can take literally many seconds if not a full minute (if you have more plugins on startup). So using Paint.Net will start the picture editing program faster, you can select faster to Sepia tone and you can send eventually faster the picture back faster. This is not to say that theoretically Photoshop if you would have 100 pictures to load it may not finish them faster (or any other program on that matter), but is often when starting a smaller program makes your task to finish faster
- on a similar note: do you have enough RAM (system memory)? When your system feels slow, is very often that is one of the two factors: you have few memory on your system (and memory is really one of the most disproportionately cheap component) or your CPU is really outdated (like let's say you want to process big pictures with a Pentium 4 1.7 GHz from 2001). But even the 2nd case is what's make your system slow, is still likely that adding memory will make it faster and usable even 10 years later!
- if your application doesn't care much about RAM and disk, is (very) probably a game, so the next question is: do you have the best video card is in your budget? It can be an integrated video card, but to be the fastest you can afford. Very few games are CPU limited, and even the CPU is to slow, and you have just 30 frames per second (so you can "feel" gaming lag), you can increase game detail and you will lose very little performance (if you have a to fast video card), and all will decrease to 28 FPS. So nothing to lose here!

What Internet says

Ok, being said that, I noticed some very bad benchmarks lately over the Internet, and they are made basically to put shocking numbers. like Android L will run basically slower than Android 4.4, or the reverse (these are Google numbers) in the attached picture, or the latest of them, a bit old, but with a big surprise.

What is funny in these 3 sources, are basically these conclusions:

From source (1) you will "understand" that performance will decrease slightly between Android K and L, but criticism is basically that is a Javascript code, the source (2) states that is not true, but Android L will run a bit faster than Android K, and the source (3) states that JavaScript is in fact the fastest, even faster than Java.

So why this mess? As a (somewhat) compiler writer, I understand the following: all compilers have tradeoffs and sometimes (like the preview builds) you enable debug information to get better information. Also, when you compare, you have to make sure you compare the same thing. The last item that matter.

Investigating the claims

So, is it JavaScript faster or slower than Java? For people following this blog, I can say that CR is faster than .Net, but I can have to make some qualifiers:
- CR doesn't handle exceptions at all, including bounds checking, NPE, so if you have a lot of iterations, you can get really awesome generated code, but it compares a minimalist "C-like" code with .Net's full code
- CR anyway uses C++ allocator which is roughly an order of magnitude slower than the .Net's GC, so if you allocate a lot of objects .Net ca be faster even .Net has exceptions (the same story is with Mono). CR has optimizations to help on this (mostly escape analysis, but users have to write code in a compiler friendly way).

Similarly: JavaScript as of today is not faster than Java, excluding you write Asm.JS (in which I would expect that Asm.JS to be slightly faster in very few cases with the same abstraction level as Java), but overall you should expect that C/C++ code will run with at least 50% faster than JavaScript with ASM.JS and even faster than without (like 2-3x times faster). This happen on a desktop class CPU only, on tablets/phones, the difference is wider, because the CPUs are weaker so the VMs there will do fewer optimizations. Here is the source I am using which is known for its impartiality.

Java most of the time (in low level math) with hot loops receives like 90% of C++ performance. In my experience, CodeRefractor was a bit faster (11%) in a very tight loop with no special hand-tune of code (but still tunning as much the compiler flags) so I can confirm as a consistent experience

How can it happen still that Java is slower in the 3rd source than JavaScript? Simple: Dalvik is not Java VM (in the strict sense) as CodeRefractor is not .Net (or Mono). Meaning that Dalvik has many designs choices which are strange in the JVM space: they are using a register based VM (like CodeRefractor) which means that their interpreter is faster (even with no compilation you should expect around 60% faster interpretation, there are scientific papers which found that interpreters on stack VMs work faster, the Google team claimed that they are around 2x faster than the JVM, which is consistent even a bit on the high side with scientific papers) and they are using a JIT (dynamic compilation) just for small subsections of the program. I would expect that they get 2x times faster because Google documented that they will preoptimize the bytecode: so the interpreter will run not your bytecode you gave to clients, but a slightly simplified Dalvik bytecode.

JavaScript VM is using a different technique on Android: is using V8 JavaScript virtual machine which uses a "direct to native" quick JIT and after that they are using a more advanced compiler for the hot code. This makes that all JS code (even JS is intrinsically slower) to be compiled.

For the sake of correctness, V8 was not all the time faster than the JIT of the Dalvik, the Android's "Java" implementation, but Dalvik was not improved from 2010 JIT wise, but most of improvements were GC related, and that Android applications will use GPU so they will use much less CPU to draw the screen, making applications to feel faster without Dalvik to become more CPU efficient. Do you remember how I've started the post: when applications need to feel faster, the best way to improve is to improve the components, not CPU only: this is what Google teams did, and I think it was the most sensible thing to do.

As a conclusion, Android L also introduces ART an AOT technique of compilation ("similar" with CR) meaning that (most of) compilation will happen on install time, making at least compilation to be program wide and I would expect as the Google team will improve the ART runtime, will make it very likely faster (if is not already) than V8.

So, how to spot bad benchmarks?
Every benchmarks which are not made for the domain of the user (typically your application), you should state upfront is a bad benchmark. Even more when is comparing "runtimes" or VMs in different class.

This is not to say that are not useful benchmarks, but you can get so easily tricked (and even experts can make mistakes, like this amazing talk of Gil Tene about GC latency and "almost"!) asking your compiler vendor (or JS vendor) to optimize for SunSpider, instead optimizing for your (web) application (like FaceBook, GMail, or what you really use from the web).

Soon I will likely announce my next (not CR related) FOSS project, and will be in the world of Java. The reason is that I understood why Java can give so much performance if written properly and the multitude of tools, but this is for another post. Also, I want to make clear that CR will likely not be updated for a time (excluding someone is interested on some feature and needs assistance).

Sources:
1. http://www.reddit.com/r/Android/comments/2987ny/just_finished_doing_444_dalvik_444_art_and_l/
2. http://www.cnet.de/88132773/android-l-und-android-4-4-4-im-benchmark-test/
3. http://www.stefankrause.net/wp/?p=144

Monday, August 18, 2014

Some future plans

Few great contributions were arriving to CR, mostly that the compiler works with more string operations (the single limitation is that they ignore the CultureInfo part), but other than this, it is done.

Also, it is feature wise even it is buggy the most complete CR version ever. You have supported cases of virtual methods, a much clean defined architecture that is incremental, optimizations are using various scopes, from block level, to multi-function dead-code elimination.

So what I would expect in future to be done:
- expect no future improvements than bug fixes excluding someone has time to maintain the project
- I will refactor it slowly to support another runtimes (an old wish of mine), in special cases if CR can work with simple use-cases for your random use-case. Let's say you want to write a runtime so CR can work with embedded C++ compilers, or you want to define a new class special for your use-case
- I will support people which want to contribute and I want with this occasion to than to the people who contributed already: Ronald Adonyo and Bogdan Mustiata, two great developers. Ronald made it much easier and transparent the testing of the compiler and he made fixes all around the runtime; Bogdan added support for most string operations.

Saturday, July 26, 2014

Why you cannot beat the compiler...

In the last weeks there was a big rewrite inside CodeRefractor. The main change is basically that we do in a multiple pass algorithm the building of code tree that can be built from the Main method. The changes we propagated from the frontend to the deep of the virtual methods backend (where the code is generated) and simplified.

using System;
class A
{
    public void F() { Console.WriteLine("A.F"); }
    public virtual void G() { Console.WriteLine("A.G"); }
}
class B : A
{
    public void F() { Console.WriteLine("B.F"); }
    public override void G() { Console.WriteLine("B.G"); }
}
class Test
{
    static void Main()
    {
        B b = new B();
        A a = b;
        a.F();
        a.G();
        b.G();
    }
}

Let's start with this code. Which is CR's output for Main method (C++ code)?

System_Void _Test_Main()
{

_B_F();
_B_G();
_B_G();
return;
}

CR doesn't have full inlining (yet!) but it has analysis steps that check if variables are used over the method boundaries and the most important part: it does it safely!

Many people may say that this code is contrived and unrealistic and I have to agree, also it has to be a "lucky case" that many things to happen:
- methods to devirtualize correctly
- the members of the method calls are not used
- the instance of B has no members that are used, so the compiler can remove it safely

But on the other hand, this is also the strength of the compilers today: one month ago CR would not be able to write this code, but today it cans, so all you have to do to get a bit better performance is just update the compiler (or wait for the next release) and voila, you're good to go. Even more, every time when in future the compiler is able to devirtualize, it will devirtualize, over all program, and the most important bit: with no work from your side!

So, every time when you write code that targets a compiler (any compiler, not only CR), if the compiler project is improving, all you have to do is to wait, or to report bugs if the compiler doesn't compile your code, or update minimally your code to avoid the compiler issues and the code will improve itself.

Friday, July 18, 2014

Some String support

A simple program as this one:

static void Main()
{
        var s1 = "Hello";
        var s2 = " world";
        var s3 = s1 + s2;
        Console.Write(s3);
}

would not work and even seems to be a simple program, in fact is a bit harder because CR didn't have a nice way to map types and to implement a mix of the CR's internal string and the real string, so there was the need to duplicate the entire code of string and make it work.

Right now it works as the following:

[ExtensionsImplementation(typeof (string))]
    public static class StringImpl
    {
        [MapMethod]
        public static string Substring(string _this, int startIndex)
        {(...)}
        [MapMethod(IsStatic = true)]
        public static string Concat(string s1, string s2)
        {(...)}
        [MapMethod]
        public static char[] ToCharArray(CrString _this)
        {(...)}
}
By adding implementation of mapped CrString for the System.String type makes possible to make implementations natural and as the library will grow (please spend very little time to support a simplified version of any class you think you can implement in .Net) many programs which were missing just the part of displaying some combined strings, right now is all possible.

Thursday, July 10, 2014

Improving debugging experience by improving internal structures

CodeRefractor has some clean advantages over other static compilers that target LLVM by targetting C++ but also some disadvantages.

One great advantage is that it is written in .Net and it bootstraps everything needed for output, and the output is somewhat human readable result. This also allows simple optimizations to be done as CR rewrites everything into an intermediate representation (IR) and this representation is simplified with many high level optimization steps.

So this instruction was in the past setup as a list of pairs of values: instruction kind and the instruction data and many casts were made based on this instruction. This is good design, but right now the code is represented as a specific instruction class. So if in the past for storing a label, you would have something like: pair of (Label, 23) where 23 was the label's name, right now is a class Label where it stores: JumpTo property which is set to value 23.

Similarly, more changes are to debug methods, in the past there was a similar pair of the method kind and a MethodInterpreter, and sadly the MethodInterpreter type will store based on the kind of method various fields which some of them were redundant, some of them unused, and changes of this kind. As the code right now is split, the experience in debugging CR is better also.

At last, but not at least, even is not yet fully implemented, it is important to say that the types right now are resolved using a construction named ClosureEntities, and there are steps (similar with optimization steps) that they find the unused types. This also will likely improve in future the definition of types and make possible to define easier the runtime methods. This new algorithm has one problem as for now: it doesn't solve the abstract/virtual methods and the fixing of them will be necessary in future.

Tuesday, July 1, 2014

Let's get Visual

About a month ago, Ciprian asked for developers to join CR, and I thought since I love the .Net ecosystem, I should be able to help a little here and there.After I looked at the code, having never really understood compilers or transpilers in this case, I needed an easier way to visualize what was going on behind the hood, so I could get up to speed with CR.

I then decided to mock up a "Visual Compiler" to both test CR and to get a simple call graph.
The Visual Compiler (VCR), once setup enables seamless development and testing of CR, alot of improvements can be made, such as full debugging support or clicking errors to take you to the line of code, but for now its a pretty useful tool. With this I have been able to catch up quite a bit and add/fix a few runtime features such as virtual methods and interfaces to CR.

VCR has the following features;
- opening files
- enabling / disabling specific optimizations
- running the C# / CPP code and viewing output
- and testing the output of C# vs that of CPP.
- multiple compiler support (although there is no gui to set that, yet)

With the knowledge gained, I am hoping to complete a Pet Project (C# Editor, Compiler and Interpreter for iOS), that has been in the Attic for about 2 years now.

I am currently writing a few more tests and looking into delegates to have CR almost feature complete with the C# language.

Ofcourse even with this tool, I needed alot of guidance, which Ciprian has been kind enough to provide. I am pretty optimistic of where this project is heading :)

Oops, Almost forgot, this is what VCR looks like;

VCR Running on Windows Server 2012 R2

My next blog post will be on the architecture of CR itself, stay tuned.

Enjoy CR,
Ronald Adonyo.

Wednesday, June 18, 2014

Let's box this and that

Roland, a great open-source developer joined CR with his expertise and I'm excited waiting for the new features will appear in future of CR. He also exposed some bugs in implementation and I was happy to fix them!

Some features:
- a simple implementation of auto boxing/unboxing is done
- optimizations are defined by category (no command line flag yet) but right now is possible to add just "propagation" optimizations, or just "constants folding" ones and so on
- some old optimizations were not using use-def optimizations, and right now they do. This means in short: they are a bit faster than in the past.

Monday, June 2, 2014

Status: Updates from time to time

This last month I did not have any commit as I was busy but I will try to give to CR some time. What I will try to address is mostly the maintanance of codebase as my (free) time is not always a thing I can guarantee.

So, in short what to expect:
- I want to cleanup some constructions: I am thinking mostly to make labels to use their hex notations. I will look if the CR works, because I made a bigger program and it looks CR fails (so be aware of this :( )
- I want to make a separate (example) backend/frontend: The idea is this: if anyone will have time, most likely CR will work with a rich C++ runtime (like Qt) to support some common classes. But it will be hard without knowing what is the front/backend job, or is to tweak the compiler
- I want to integrate in future to C# (most likely gmcs/csc) compiler: so you can run something like:
[mono] crcs.exe <...>
and this code will be run the compiler frontend.
- In a longer future I want to add some unit tests

Please add me to Google+ and I will be glad anyone wants to make look into CR and is blocked in understanding how to work with the project. I will be glad to help and assist anyone working with the codebase.

Friday, May 9, 2014

Errors are mostly fixed

By mostly I mean that is somewhat good to use, and there may be a bug here and there, but the type closure bug is fixed. Also the documentation is also updated with the new architectural changes.

So, you may take the tip of the Git repo now and play back. Is not fully stable but it generates most of the code at its place.

Tuesday, April 29, 2014

Small fixes - still hope for external support

In the last days I had some small time to look over the code. In short I want to explain the main change. In the past the CR compiler worked with a singleton CrRuntimeLibrary logic, meaning there cannot be two runtimes at the same time. Of course, this is the typical case, but as in (long) future, it would be nice to support a minimal runtime for some methods (I am thinking around mostly OpenCL) that give a separate output.

Also, this code is split in this way for ideological issues: it is easier to debug and maintain code without singletons. The bugs also show some small limitations how the code was handled in the past, and right now, the runtime can solve methods, giving a more clean case implementation.

There are bugs in TypeClosure logic (maybe I will explain when I have time what it means ;) ) but don't take the Git top code for now if you want working code!

Saturday, April 5, 2014

Development will stall for some time

I encourage you to take the CR sources and play with them and try to advance them. Anyway for the next months I will work very little to improve CR: I have little time and I'm split in between work, family and other time consuming issues.

What I've learn is a fairly deep how a CIL instructions work, and that is possible technically to move most of instructions (if not all) to a C++ implementation, but there will be always bugs and work-arounds and they require time which very few people want to invest (none!?).

Based on this, I can say that I will be glad to support people to write tools or help them fixing bugs, but I will not work for now at least to improve CR. Maybe when you have more time, please try to add a feature or to fix something. Some parts are partially implemented and they need just "moving code around" to support more cases: delegates, generics, virtual calls have at least some code to start. If they are fixed, basically it will work a good chunk of .Net 2.0 codes.

This is it for now, I really hope that some people will try to clone the repository and improve the code and later I will be really glad to improve the needed parts, elsewhere it looks for now a too small incentive to make it work.

Wednesday, March 5, 2014

RyuJIT, Mono 3.2.7 and Java 9 (a possible future of stack virtual machines)

First of all I want to notice that a lot of development is in the area of managed runtimes so I want to give my personal take on this. This has nothing to do with the CR project, excluding that working on CR in my free time gives me more insight of the internals of these runtimes.

So what is known as of today:
- Microsoft works for a next-gen unified (no more 64 and 32 bit differences) .Net optimization framework that will be finished in an unspecified future version of .Net. From what we know, it has a much better Regex engine (that is it more than 10x faster than the previous ones) and an around 20% faster startup JIT times.
- Mono improves steadily (it has Loop Invariant Code Motion eventually) and it will rely more for Mono and LLVM integration for mobile phones.
- Java has a modular future with Java 9 (very similar with GAC in .Net, possible stack values and "compact" layout).

So what is the best runtime to invest today, knowing the next two years of proposed upgrades? As always the answer is: it depends. But I can try to split them by usage:
- want to implement a game: I would go with Mono. Mono has SIMD support, when for now .Net has no announced support. The workaround on the MS site is Mono. Even more, Mono supports LLVM as a backend and if it works ahead of time (at least for phone platforms). If you have a weird use-case, you can patch Mono and the Mono runtime.
- want to implement an application: if you want to use native standard controls use Mono. Mono has native controls on all platforms. You can import and develop it (fast) on Visual Studio tools and "migrate" it using for example XWT controls or Eto.Forms, in order not to redo the interface multiple times
- server side: if you will make all clients in Mono or .Net I would pick .Net. At least you can afford the price of Windows Server and/or pricier Visual Studio editions. If you are budget constrained I would pick Java all the way. Mono is a horrible choice. The reason is not how fast Mono runs (as Mono runs "just" two times slower) but because Java has a mature development platform for complex problems. Also, Mono tends to be a one, maybe two, language paradigm, but Java is polyglot: Ruby (and Mirah), Clojure, Kotlin or Scala are top-notch implementations with free IDEs. Even .Net is theoretically better, but the language support in the Java world today is far more vibrant
- fast response servers: today Java is the de-facto big-daddy of implementing fast response servers. Twitter moved from Ruby to Java. There are also more resources which are vendor neutral that document most of Java behaviors (DZone, InfoQ, Parleys, etc.).
- native application feeling (only for Windows): "native" is a word strangely defined by everyone, so I really dislike the term, but playing by its rules, if we define it as a Win32 looking application, the single solution is .Net with Windows Forms (or maybe Xwt). If we mean Windows with animations, the WPF is the single game in town. DevExpress controls are really standing out. Resharper (on Windows) can guarantee great code quality for your application too. OpenSource solutions do not provide something that matches WPF for now, but who knows? Maybe a future designer for XWT will appear for something like this.

So given this preamble, what I would want to see, per platform:
I. Java
- make a low-overhead PInvoke like solution. This makes platform native/Java applications to work slow, so people tend to write them in Java only and skip Java all-together with a simple API. The pieces will likely be available at the Java 9 release, as they propose stack based types and compact representation. Even more, the packed objects will occupy less memory, another common criticism for Java.
- make tiered-compilation default (even on 64 bit machines) and Jigsaw: Java has profiles of running like Client (fast startup, weak optimizations) and Server (slow startup, strong optimizations). Tiered compilation starts programs in the client mode, and later switches to Server mode for methods that get more optimization. Even better, it would be that the Server compiled methods would run on a separate thread. This can improve the responsiveness of applications
.Net
In fact, it looks to me that .Net is a bit into "maintenance mode", it looks that .Net wants to be the previous VB: a convenience runtime. This is maybe because .Net conflicted somewhat with C++ development and they want to have the two development modes: "the native-fast" C++ and managed with 80% of C++ performance. RyuJit seems to me an effort to simplify code, and this is great don't get me wrong, but other features are missing and some users are missing them, for example a performance mode (there is Server and Desktop mode, but the differences are not in generated code, but in GC configuration and which assemblies are exposed for the application). As for me, the possibility of annotating functions to use SIMD or at least to support Mono.SIMD would be awesome. At last, it would be nice to support a form of full ahead-of-time compilation. I mean to make an executable that depends on .Net and doesn't do any JIT. This can allow some optimizations that can make some usages of .Net in the real world (mostly on games) where it is today mostly avoided.
Mono
For me Mono as runtime is the most advanced, even more advanced than Java, as it supports multiple compilation profiles, ahead of time, LLVM, intrinsics, many platforms (phones OSes, NaCl). Even it sounds maybe picky, Mono for me is missing the user-friendliness (even if it is miles ahead than what it was, let's say, 5 years ago). MonoDevelop (Xamarin Studio) + NRefactory (the framework that is used to give semantic understanding of your code) of code is really good for opensource world, yet still is a bit ignored. Most people are using Mono as a backup runtime, not as their primary platform to develop on. Code-completion is still a bit buggy as of today.

As for Mono and .Net parts to improve, I can tell the two items I found them quite important: escape analysis, purity and read-only analysis: some objects can be proved that they never "bleed" their references. This sounds like a not so important optimization, but it can reduce the overhead to the GC, and this is not a small deal: if let's say 10% of objects are not allocated on heap but on stack, is a speedup with no runtime to optimize for. Even more, there are guarantees of non synchronizing for these variables (which again is a great deal). I would love to have an interpreter integrated with adaptive compilation, like Java does: simple methods that are not run enough time to not be compiled at all. This can mean a lot more optimization (maybe compilation can happen on a separate thread) but also fast(er) startup time.

At last, (you can skip if you are not interested in CodeRefractor per-se), as much as I think that dynamic-compilation is the best way for managed runtimes, I think also that a full static compilation is great for some usages. When your application occupies let's say 10 MB of memory (I know is very little for many GB of RAM today) and you want a fast start time as you want to do some processing, you will not be so happy, if you will see because your application has a GC memory area that is an extra 5-8 MB and the runtime itself occupies another 5 MB that your application to use double memory that you expect. Using a solution like CodeRefactor you can reduce the memory requirements for your application (if it can compile your application), because no runtime is is running independently with your running application and no GC also. Memory is less of a requirement today and we are told for many years that memory doesn't matter, but if you want to get a faster application, you should care about memory compactness of your data.

Wednesday, February 26, 2014

Technical: Which optimizations to expect from CodeRefractor?

This entry is not about any status, but about the status of CR today. So, which optimizations you should expect that CR will do for you, and which will not do it for you.

I will take the most important as matter of performance and usages. Also I will try to see where are problems.

GC and Escape Analysis

If you will look in generated code of CR, when you declare a reference, typically will use an std::shared (also known as smart-pointer, or reference counted pointer). This form of freeing unreferenced variables is very common but even for very simple operations it will work fine. But every time when you assign a smart-pointer, the CPU will have to increment (and later) decrement the item.

This analysis will remove a lot of unnecessary increments/decrements but it has some caveats: it works really well for local variables, but it doesn't do well over the function boundaries and especially in cases of return values.

So how to make your reference-counted program to work fast?

First of all, remove unnecessary assignments of references.

Let's take this simple code:
class Point2D { ... public float X, Y; }
class Vector2D { ... public float X, Y; }
Vector2D ComputeVector(Point2D p1, Point2D p2)
{return new Vector2D{ X = p1.X-p2.X, Y = p1.Y-p2.Y }; }
var v = ComputeVector(p1, p2);

The easiest solution to fix this code is simply: define class Point2D and Vector2D to be struct (value types). Other solution is to recycle the object as the following code:
void ComputeVector(Vector2D result, Point2D p1, Point2D p2)
{
result.X = p1.X-p2.X;
result.Y = p1.Y-p2.Y;
}

This is good because the usage code can be something like:
var result = new Vector2D();
ComputeVector(result, p1, p2);
Console.WriteLine(result.X);
Console.WriteLine(result.Y);

If this would be your final code (or a version like this) and result will not be used anymore, the result variable will be declared on stack, avoiding not only smart-pointers, but memory allocation overhead.

Constants and propagation of values

CR will extensively evaluate every constant and will simplify everything at maximum scale (of course, excluding there is a bug). This means that if you use constants in the program, they will be moved all over the body of the function. So, try to parametrize your code using constants if is possible. Also, if you use simple types (like int, float, etc.), I recommend create as many intermediate variables you want. CR will heavily remove them without having any bearing for a programmer. So, if you don't work with references to heap objects (reference objects), use as many as possible variables. The compiler will also propagate the values as deep as possible in the code (there are still small caveats), so if you keep a value just for debug purposes anywhere, keep it, it will make your code readable, and this is more important.
Constants will not only simplify formulas, but will: remove branches in if or switch statements (if can be proven constants). This is good to know for other reason: if you have code like: if(Debug) ... in your development, you should have the piece of mind that this if statement will not be checked anywhere in the whole code.

Purity and other analyses

Without going to a functional programming talk, the compiler will check for some function proprieties, the most important of them are: Purity, ReadOnly, IsGetter, IsSetter.
IsGetter and IsSetter are important to inline them every time you use an auto-property.
A pure function is a function that uses just simple types as input (no strings for now, sorry), and it will offer as a result a simple value but in between will not change anything global. A read-only function is a function that can depend on anything (not only simple types) but will not change anything global, but return a simple type.
Writing code and using pure functions everywhere, you will have the luxury that the compiler will do a lot of optimizations with it: when you call with constants, it will compute them at compile-time.
So when you write Math.Cos(0) will be replaced with 1.0, and even more, it will merge many common expressions even functions calls.

Unused arguments (experimental)

When you write your code, you want to make it configurable, based on this, you can add many arguments, and some of them will not be at all used. The latest (git) code removes the unused arguments. Do you remember the part of using constants everywhere? So if you call a function with a constant, and this is the single call, the constant is propagated in call, and after this the argument is removed.

This code is as efficient as #if directive, but done by the compiler with no work from your side (excluding the part to set the environment to enable(or disable) debug, or whatever flags you have in your program).

Conclusions

There are other optimizations, but many of them are low-level and are not so practical but in short I can make a small list of things you should do to improve performance in Code-Refractor generated code:
- when you need to change a variable and this variable is a result, give it as an external parameter
- use constants everywhere for configuring your runtime.
- use local variables of simple types as often as you want.
- try to make functions to not change too much stuff, make them small and with one target (like a computation). They will be really well optimized out. It is possible that entire calls will be removed if you write small functions
- make branches configurable with simple type parameters. The compiler will speed up the code
in some cases if it can prove that your configuring variables are set with specific values.

Thursday, February 20, 2014

Target 0.0.3: Status 3

It looks that you see many statuses in this period for a reason: I had a fairly light workload at work and I had time to think for some fine things to add to CR. In future I will not guarantee the same frequency and the same speed. The best way to make CR to develop fast, is to contribute ;)

So, what is done and note-worthy:
- type-id support and a trivial/crashing Virtual method calls. People wanting to have virtual method implementation, can go into the implementation and fix the remaining things (or still wait): make possible to run virtual functions with more than one parameter, to take in account of escape analysis, and things like this
- type closure is more robust: types are sorted for dependencies, and methods and types are discovered in the same time. It is possible right now to have more entry methods at least on the code level. This can make possible in future to support DLL/lib code generation or to make sure that statically are added all methods needed
- added state of escapeness of variables of unused, the variables in the past could be just: stack, pointer and smart-pointer. This is very important as it can detect (in future) that a function argument is not used, so the compiler can remove the usages of that call and making the call fast(er). More work will be definitely in this area.
- make an unified way to keys to identify methods: this is as important as type-id for me: every time when a function is called, a strange code was written to look if the method is a runtime method, or is a generic method, etc. Even worse was the comparison between functions as overloads. I will expect with this change that generics will be returned with the original prototype (like will be a clone for every static specialization, and no C++ templates are used).

Basically: this status means that there is a big base to be improved in CR. And the more fixes there are, the better you can expect your program to work. So try to look into the code and make simple fixes.

What I expect in future to work at:
- try to make a program wide optimization kind. In the past and all optimization passes of CR are just accessing one function as input, but some cases are better handled if the visibility of the optimizations is global. This optimization could remove a virtual call if it knows that there is just one implementation globally.
- look into the Qt's Core runtime. I worked in my past with Qt 4.x (x being a small number) and my experience was very good. Qt could be a great base to offer a good implementation for some operations like: directory navigation, string encodings, and things like that. It would be really ugly to try to implement them in C++: there is no standard multiplatform implementation I'm aware to handle them so well