Wednesday, February 26, 2014

Technical: Which optimizations to expect from CodeRefractor?

This entry is not about any status, but about the status of CR today. So, which optimizations you should expect that CR will do for you, and which will not do it for you.

I will take the most important as matter of performance and usages. Also I will try to see where are problems.

GC and Escape Analysis

If you will look in generated code of CR, when you declare a reference, typically will use an std::shared (also known as smart-pointer, or reference counted pointer). This form of freeing unreferenced variables is very common but even for very simple operations it will work fine. But every time when you assign a smart-pointer, the CPU will have to increment (and later) decrement the item.

This analysis will remove a lot of unnecessary increments/decrements but it has some caveats: it works really well for local variables, but it doesn't do well over the function boundaries and especially in cases of return values.

So how to make your reference-counted program to work fast?

First of all, remove unnecessary assignments of references.

Let's take this simple code:
class Point2D { ... public float X, Y; }
class Vector2D { ... public float X, Y; }
Vector2D ComputeVector(Point2D p1, Point2D p2)
{return new Vector2D{ X = p1.X-p2.X, Y = p1.Y-p2.Y }; }
var v = ComputeVector(p1, p2);

The easiest solution to fix this code is simply: define class Point2D and Vector2D to be struct (value types). Other solution is to recycle the object as the following code:
void ComputeVector(Vector2D result, Point2D p1, Point2D p2)
{
   result.X = p1.X-p2.X;
   result.Y = p1.Y-p2.Y;
}

This is good because the usage code can be something like:
var result = new Vector2D();
ComputeVector(result, p1, p2);
Console.WriteLine(result.X);
Console.WriteLine(result.Y);

If this would be your final code (or a version like this) and result will not be used anymore, the result variable will be declared on stack, avoiding not only smart-pointers, but memory allocation overhead.

Constants and propagation of values

CR will extensively evaluate every constant and will simplify everything at maximum scale (of course, excluding there is a bug). This means that if you use constants in the program, they will be moved all over the body of the function. So, try to parametrize your code using constants if is possible. Also, if you use simple types (like int, float, etc.), I recommend create as many intermediate variables you want. CR will heavily remove them without having any bearing for a programmer. So, if you don't work with references to heap objects (reference objects), use as many as possible variables. The compiler will also propagate the values as deep as possible in the code (there are still small caveats), so if you keep a value just for debug purposes anywhere, keep it, it will make your code readable, and this is more important.
Constants will not only simplify formulas, but will: remove branches in if or switch statements (if can be proven constants). This is good to know for other reason: if you have code like: if(Debug) ... in your development, you should have the piece of mind that this if statement will not be checked anywhere in the whole code.

Purity and other analyses

Without going to a functional programming talk, the compiler will check for some function proprieties, the most important of them are: Purity, ReadOnly, IsGetter, IsSetter.
IsGetter and IsSetter are important to inline them every time you use an auto-property.
A pure function is a function that uses just simple types as input (no strings for now, sorry), and it will offer as a result a simple value but in between will not change anything global. A read-only function is a function that can depend on anything (not only simple types) but will not change anything global, but return a simple type.
Writing code and using pure functions everywhere, you will have  the luxury that the compiler will do a lot of optimizations with it: when you call with constants, it will compute them at compile-time.
So when you write Math.Cos(0) will be replaced with 1.0, and even more, it will merge many common expressions even functions calls.

Unused arguments (experimental)

When you write your code, you want to make it configurable, based on this, you can add many arguments, and some of them will not be at all used. The latest (git) code removes the unused arguments. Do you remember the part of using constants everywhere? So if you call a function with a constant, and this is the single call, the constant is propagated in call, and after this the argument is removed.

This code is as efficient as #if directive, but done by the compiler with no work from your side (excluding the part to set the environment to enable(or disable) debug, or whatever flags you have in your program).

Conclusions

There are other optimizations, but many of them are low-level and are not so practical but in short I can make a small list of things you should do to improve performance in Code-Refractor generated code:
- when you need to change a variable and this variable is a result, give it as an external parameter
- use constants everywhere for configuring your runtime.
- use local variables of simple types as often as you want.
- try to make functions to not change too much stuff, make them small and with one target (like a computation). They will be really well optimized out. It is possible that entire calls will be removed if you write small functions
- make branches configurable with simple type parameters. The compiler will speed up the code
 in some cases if it can prove that your configuring variables are set with specific values.

Thursday, February 20, 2014

Target 0.0.3: Status 3

It looks that you see many statuses in this period for a reason: I had a fairly light workload at work and I had time to think for some fine things to add to CR. In future I will not guarantee the same frequency and the same speed. The best way to make CR to develop fast, is to contribute ;)

So, what is done and note-worthy:
- type-id support and a trivial/crashing Virtual method calls. People wanting to have virtual method implementation, can go into the implementation and fix the remaining things (or still wait): make possible to run virtual functions with more than one parameter, to take in account of escape analysis, and things like this
- type closure is more robust: types are sorted for dependencies, and methods and types are discovered in the same time. It is possible right now to have more entry methods at least on the code level. This can make possible in future to support DLL/lib code generation or to make sure that statically are added all methods needed
- added state of escapeness of variables of unused, the variables in the past could be just: stack, pointer and smart-pointer. This is very important as it can detect (in future) that a function argument is not used, so the compiler can remove the usages of that call and making the call fast(er). More work will be definitely in this area.
- make an unified way to keys to identify methods: this is as important as type-id for me: every time when a function is called, a strange code was written to look if the method is a runtime method, or is a generic method, etc. Even worse was the comparison between functions as overloads. I will expect with this change that generics will be returned with the original prototype (like will be a clone for every static specialization, and no C++ templates are used).

Basically: this status means that there is a big base to be improved in CR. And the more fixes there are, the better you can expect your program to work. So try to look into the code and make simple fixes.

What I expect in future to work at:
- try to make a program wide optimization kind. In the past and all optimization passes of CR are just accessing one function as input, but some cases are better handled if the visibility of the optimizations is global. This optimization could remove a virtual call if it knows that there is just one implementation globally.
- look into the Qt's Core runtime. I worked in my past with Qt 4.x (x being a small number) and my experience was very good. Qt could be a great base to offer a good implementation for some operations like: directory navigation, string encodings, and things like that. It would be really ugly to try to implement them in C++: there is no standard multiplatform  implementation I'm aware to handle them so well

Saturday, February 15, 2014

Target 0.0.3: Status 2

CR backend C++ code generator functionality is extracted. This will reduce the logical dependencies and simplifies some sections of the code. There are also improvements to make Generics to generate templates.

Anyway, this functionality cannot be done in all cases so I will have to find in future a work around. In short the problem is the following. Let's say we want to implement a class like MyData<T>, and T can be either a struct or a value. If T is a class type, it has to have separate logic than struct type. The reason is that in future, reference types have to have extra information to identify these types, a "typeId" integer type. As MyData<int> has no typeID necessary, and MyData<T> may need to keep the typeId of the specific type. Even if we share the typeId and in the case of Struct code will be a no-op, it still doesn't offer a nice solution if the MyData<T> has the new() constraint.

In short:
- there will be an improvement in future that will make generics to work with templates, but the today's code with Generics will do generate incomplete types (so you will get compiler errors). In fact many of these errors are medium complexity, and fixes are always welcome
- code is separated to make easier to understand how the C++ code is generated, this component is a smaller Dll

As a future improvement I would expect to try to (incompletely) implement the TypeId system: all classes defined as references will have an int that uniquely identify their types. This implementation detail is critical to make some instructions to work: is/as, boxing/unboxing and virtual/interface calls.

Update: CodeRefractor works with Linux/Mono (I tested with OpenSuse, but other distro do work nice too). If you use Linux you basically have to do the following:
- make a clone of Git repo
- Enable this repository on Suse, Fedora, or any supported distro: http://software.opensuse.org/download/package?project=home:tpokorra:mono&package=monodevelop-opt (or build a recent Mono/MonoDevelop by yourself, at least 4.0.x)
- open CodeRefractor.sln and run it under "Debug | Mixed platforms"
If you prefer command line after the initial build is done you can try to use it with:
mono cr.exe <any Mono/.Net exe>
I am really impressed of the quality of Mono tools and as for you, if you wanted to work on CR but you never knew that it runs on Linux, right now you know!

Saturday, February 8, 2014

Target 0.0.3: Status 1

One two tree, done!



The compilation step is much faster. Optimizations can work (by uncommenting some lines of code) even by using multiple cores. As for example, the screenshot from the OpenGL application I had presented it earlier was taking like 600-700 ms on an Intel i5-540M. Right now it takes for CR to generate the code around 170-220 ms (the first time is slower, as spinning disks are slower at access). But by all measures is much faster.

Most optimization steps that CR perform do the following: look for a pattern of code that match a property, after that if it matches, it tries to perform the optimization by impacting some instructions, it notifies the compiler that some changes are done. CR after this will try to perform all optimizations up to the point no optimization can be done. CR in a typical case will apply (using the default codebase optimizations) 35+ optimization steps for every function, every instruction, etc.

In typical case let's say there is just one some optimizations that can be done, for example: a variable is nowhere used, so it can be removed. Before noticing that the variable can be removed, the compiler performs other optimizations, and right after will perform another pattern matching. At last when the optimization of the unused variable is match, for every instruction CR will track all declared variables and CR will compare with all used variables. At the end what remains, they can be removed.
But as someone can notice, the step that makes that one optimization succeed is based on some most common knowledge:
- variable usages per instructions
- the instruction kind: if one optimization will remove a declared but never used variable, it will have to take in account if the instruction is a call to a function or if is a simple math operation
- jumps: some optimizations do work just for a sequence of instructions that have no branches and jumps, so looking for jumps and labels are important delimiters for these instructions

Based on this, every time an optimization is performed, the optimization framework in code will recalculate this information so it can be reused. So if the first optimization step needs to check variable usages, and doesn't perform any optimization, the second optimization can use the same usages data, as correct as no change is done.

Some Q&A:
- even 200 ms (for a small) is fast, most time will be used still inside the C++ compiler, so why bother? Because is not always so clear-cut. Also, using a lower optimization level on GCC, or using a fast(er) C++ compiler, makes to matter
- 200 ms is still much (for a 40K application), cannot be reduced to 50 ms (CR doesn't do too much as for me)? In short CR does more than optimizations: it tries to understand the code, to map it to an intermediate representation, after optimizations are done (which is this post is done), CR writes the code on disk. CR have in its design some inneficiencies: for example most CIL instructions are compared using string operations (for clarity), not using the instruction IDs. Using instructions by Id will speedup some (few) milliseconds.
- compilation speed was never an issue for me, so why bother? The CR is intended to be an optimizing compiler/VM. Excluding the fact that most people have fast computers, some people do not have it. Many people develop using an Atom CPU. Also, having a fast compilation speed translates into using these 500 ms (saved) to add in future more optimizations.

Tuesday, February 4, 2014

Skipping release 0.0.2 - jumping to 0.0.3

As for Git commits, you will see some activity in the last week. Anyway the single key features if you follow the Git repository are all reached (to some extent) and all were proposed for 0.0.2:
- better field matching
- auto-properties do work
- better delegates and generics (top Git code is broken now, look why in the 0.0.3 feature set)
- fixes in some cases like math with various sizes
- most optimizations (especially escape analysis) work over more code patterns

As many programmers know code fixing takes time and I took this time, but I will not pack an installer for 0.0.2 as I found that some features are important to be implemented in a different way and there is no support yet (and request) to make a release package, so why do it?

The changes of 0.0.3 will be:
- visibly faster compilation: use-def code will be computed and updated on changes, and not computed on every optimization. The current code is already in the Git repository
- more fixes (of course expect a breakage in the first time)
- the Dynamic-Invoke mechanism should be more fleshed out
- generics will fully use templates <T> instead of using specializations. This translates into a single compilation of one function (from CR side). If you will define a class List<T>, the old code will create List_Int32, List_String, etc
- expected: a fully inter-procedural optimization which tracks over whole program the function calls. This will allow some very powerful optimizations that are used today into GCC's LTO. This can be done just if is it guaranteed that the compilation speed is improved.

As for this release I would expect a "release" time around May/June 2014.

If you want to get more features, you are encouraged to add every small fix. They really add up and can make the users to enjoy using CR.

So in short, if you think that CR can in future match your needs, and even if you find it a bit overwhelming, try to add some fixes or add a new feature like:
- improve command line flags: to support more compilers/OSes. Many fixes are really one or two days of work (maybe less)
- try to make an application from simple to complex and see where CR crashes. Try to add support for that specific instruction. Is it much easier to follow one instruction than to look for entire cases. Try disabling (by commenting, before looking how to set a lower level in optimizations) all optimizations
- add support for a class in .Net using the Dynamic-invoke support. For example add File.ReadAllText(...) and to work at least for some basic codes. Add after that File.ReadAllLines(...)
- try to make an OpenGL application and see what is not supported (you have a sample application in the Git repository)
- try to add an optimization step: make your class that checks a condition of the instruction set. For example if you divide by 0.5, to replace it by multiply with 2.0. Try to add more optimizations like this.
- automate buids and make yourself a packaged installer
- make a MonoDevelop/VS plugin to automate launching of CR against a project
- anything else (that is really welcomed)

CR is free for use (free as in price and freedom) and you are free to customize it for your future needs.