Code Refractor - Virtual Machines/Compiler performance musings: 2013

Tuesday, December 24, 2013

Target 0.0.2: Status 4

The previous post was that are many lurking errors in optimizations. I'm happy to say that these errors were fixed (there are some optimization failures sometimes, but the "canonical" NBody is working again with the tip revision in GitHub).

The optimizations are also profiled, so it should compile in 10-15% faster than previously (the CR compiled part, but most of time you will spend inside GCC compilation).

CodeRefractor.Compiler.exe is renamed to cr.exe so it is easier to use it from command line.

As future development I will look into two areas:
- fix some known last remaining bugs (the inliner gives for some properties invalid code, also it doesn't optimize out the empty methods )
- I would love to extend some parts of backing generated code: it would be great to give a "resolver" to help the linker with custom code.
The idea is that delegating implementing methods can be done by some solver module:
- when a method is found by compiler, before reflecting it, will ask the solver module if it has a specific implementation that can be either a C++ code or a CIL method that CR will reflect against
- if it is a C++ code, the code will be inserted as-is as being a body of code.
- if is CIL code, it will be scanned and it will be used later.
- similarly a similar logic will work for types.

Why this bootstrapping, and who will implement it?
Initially, OpenRuntime (the default run-time) but it can be attached for all assemblies you code against. So every time when an assembly is loaded, the CR will scan for this bootstrap mapper.

Advantages? Right now the way the bootstrapping to C++ happen is fully static. This means that if you want to implement a big subsystem (let's say a replacement for System.Console or Windows.Forms) you cannot point to an existing implementation, but you have to take one source and fully annotate it with [MapType] (the CR's attribute of choice to annotate custom implementations) attribute. With this new implementation, and some short reflection code (that can be isolated, and will be a helper class given with these APIs) will theoretically be possible to use an implementation from an implementation that support the implementation without annotations.

Also, it could make possible to parametrize some C++ code generator options. Right now the code is static, so you cannot write a Max function independent on parameters (like a generics Max), and I hope with these changes the code will get better formatted and generated.

The idea is not my invention, but is based on the brief descriptions of InvokeDynamic instruction in Java world. Anyway, I did not read the specification (and/or implementation details) for the reason of having a clean room implementation.

Friday, December 20, 2013

Target 0.0.2: Status 3

As I've said previously there are some optimizations that are hard to find when they interact badly. If you face problems generating code, you should disable optimizations or try to reduce the errors and make a bug report.

Given this, I was fixed one that I could reduce the case. But at the end, I will recommend users to review get small cases as bug reports.

The MSIL and C++ code can be seen side by side and when the compile fails.

A snippet of the output code of an Prime number program:

// IL_0049: ldc.i4.1

// IL_004a: stloc.0

// IL_004b: nop

// IL_004c: ldloc.1
vreg_32 = local_1;

// IL_004d: ldc.i4.1

// IL_004e: add
vreg_34 = vreg_32+1;

// IL_004f: stloc.1
local_1 = vreg_34;
label_80:

// IL_0050: ldloc.1
vreg_35 = local_1;

// IL_0051: conv.i8
vreg_36 = (System_Int64)vreg_35;

// IL_0052: ldarg.1

// IL_0053: cgt
vreg_38 = (vreg_36 > vreg_37)?1:0;

In this way you can see if any optimization interact badly with your CIL. Also, reducing the optimizations you will see the C++ code reflecting more the CIL code, and the more optimizations are added, the CIL and C++ code do go appart.

Int64 (aka long) type is also supported right now both as instructions (as conv.i8 or ldc.i8) and as optimizations (constant folding).

Tuesday, December 17, 2013

Opinion: Talking ARM

AnandTech (a technology site) invited people to put questions regarding ARM's core A53, the 64 bit successor of A7 core (which is 32 bit and widely used in mobile phones today).

The lead designer was kind enough to answer it as of yesterday.

Which is the best part as for me?
They talked about the fact that ARM is scaled both down and up. This means that many software will need to extend to many places. I see it it will work on microservers, set-top boxes.

The high end ARM CPUs can have L3 cache (AMD 8-16 core CPUs were high end, and Intel goes with similar core counts) for designs to up-to 32 cores. It also looks that ARM pushes optimized GCC libraries and they are a contributor to it.

What means for CodeRefractor? Or for C#?
It doesn't on a short run. The good part of CR is in my view is that C++ runs optimized on most platforms correctly: from low end ARM to Asm.JS. This will work with fast everywhere.

This will mean in long term that most things that work on C#, will work on more hardware than CLR runs, or even it runs everywhere, you will have to work with many VMs. With C++ target of CodeRefractor output, you can make a simple build that will work everywhere.

Saturday, December 7, 2013

Target 0.0.2: Status 2

This is a small update (and the following ones will be also somewhat small as the close to New Year holidays will come) but important:
- after the latest changes I've noticed that some optimizations interact badly but I did not have time to debug all. So right now is enabled a smaller subset of optimizations and I would be really glad if someone has time to look into them (why they are failing). Optimizations are really important to make sure that programs perform fast, but the hardest part is to make all simplifications or analysis correctly
- after the bad piece of news, there is also a good piece of news: derived classes in which the data is stored along many classes (like you have class CoordinateXyz and a class Vector: CoordinateXyz, the class), the code will be stored and displayed in a compatible binary layout. In the past, the fields of the base class were put after the fields of the principal class. The logic of fields analysis is extracted into a class named TypeDescription

Thursday, November 21, 2013

Target 0.0.2: Status 1

Three small albeit important updates are made in the GitHub repository:
- auto-properties:
    class Test
    {
        public int Value { get; set; }
    }
Properties were working in the previous releases, but auto-properties generate invalid names (for C++), so is

- static constructors were skipped:
    class Test
    {
        public static int Value { get; set; }
        static Test()
        {
            Value = 2;
        }
    }
In the past the static constructor was skipped. Right now the code is generated.

- constant fields are not defined as a part of the final code:
    class Test
    {
        public const int Total = 2;
    }
In the CR 0.0.1 the Total field was a part of the memory usage of the Test class instance. Right now constants are removed, reducing the memory usage and making it consistent with the .Net usages.

Monday, November 11, 2013

Roadmap for 0.0.2

As all development for now was made by me, and as much as I can see, CR needs improvement and I would go into future planned features, I see some parts that do need improvement but as always, the way I see to improve CR is to improve the quality of the project.

First of all let's go into CR misses (and are a target for 0.0.2):
- generics are very limited. I would like to see more cases with generics and to make them working. Some code with generic classes is there, but not all combinations do work;
- some commits were done just after 0.0.1 in the Git repository, and I think there will be a primitive delegate support;
- the actual optimizer is inter-procedural but there is no global variable pool. I would hope to add support for it and try to add simple optimizations regarding this. In short if you will declare a global static int/double, even you don't specify that is it const, CR will try to infer this. If a global variable is not used, CR should remove it (this is important as CR doesn't support reflection or things of this sort)
- handle fields better: if you will use a class that the base class names a field the same as inherited class, CR will "merge" them incorrectly.
- it would be great that instance objects that are created and not used, to be removed (this is in fact a trivial optimization, but it has to take extra care for cases when static constructors do initialize states)
- try to compile a target application: an SDL/OpenGL application and fix all the found blockers.

What is not yet a target, and I encourage anyone interested into more tasks that are useful but not intersting at least for now for me:
- better command line handling: CR supports switching runtime by changing the assembly, switching the C++ compiler, etc.It would be really great if someone will make a consistent and nice command line handling
- integrate CR with C# (or VB.Net) compiler: create a small tool that will invoke C# compiler first (CSC.exe for Windows or MCS.exe for Linux/OS X) and CR later transparently for the user.
- support Linux/OS X/ 32-64 bit differences, ARM/MIPS CPUs: "just" checking if CR works on other platforms including various compilers takes time which I basically don't have. If you feel inclined to support a platform and you need my (minimal) support to setup it, I will be glad to give this minimal help to make it running. I will be glad to put patches to support various compilers or configurations, so if you want to use boost::shared_ptr instead of std::shared_ptr and you write on your end the patches, I will be glad to include them upstream, but I'm not interested to support by myself anything else than my machine
- better support for the mapped runtime: add complete String class, or List<T> or Dictionary<K, T>, etc; this is
- VTable support, exceptions, Reflection support, Linq and lambdas, "if you will implement this I would use CR..." kind of stuff. The reason for that is simply: I either don't have time or interest to support this (or most likely both), and many times I've noticed that fixing some small items at a time make some parts to work by themselves, like support for properties is working now, and even there is no VTable, better static code analysis can remove (for at least some usages) the need of VTables. Would it be great to have them? Yes, as long as a developer (aka "you" - the community) adds support for it.

As a timeline I hope that 0.0.2 will be released around March or April 2014, but it may be earlier. From time to time (in the past was like a bi-monthly issued) I will write status reports with prefix: "Target 0.0.2: Status ..." where the Git tracked development is described. This is good for you to follow them if you are technically inclined to read an "internals" digest of them.

Friday, November 8, 2013

Why not NGen?

A legitimate question, and I want to answer at least in the way I view it:

An honest question: what is the value of your tool over NGEN?

First of all, as I wrote just one blog entry before, I don't believe in "Native vs Managed", because every time people take these words differently:

- if people think by managed as being safe, STL from C++, or Qt's Tulip and QObject is a "managed" subset that keeps most of what .Net offers (basically bounds checking, strict typing, generics, somewhat consistent auto-free-ing of memory, etc.);

- if people think as Native as being "close to metal", even .Net compile to native all methods as they are executed, so excluding the first seconds of startup, all .Net applications are native

Given that, I see a tool as CR useful still as it has a similar compilation profile as C# and there is a lot of tooling of C#, but an execution profile of C++ with LTO (link-time optimization). As far as I see how CR will advance in future, it will support always a subset of what .Net will be able to offer, so people thinking into removing .Net (or Mono) at least as a toolset of CR, I cannot see it as being done to soon.

In the same time, even I see CR as "never catching" .Net, it will be still a great tool, and some use-cases I can see them as getting beyond what .Net can offer.

Let's take a medium program that I think it will be great case for CR (as for a version like 0.5 or 1.0): a developer writes a C#/SDL/OpenGL game and he/she will want to run it on a slower device (let's say for now is a Rasberri Pi, but it doesn't matter that much). First of all, he or she will improve the OpenGL calls, second of all will try to improve the execution profile using Mono.

Using Mono will see first of all that the application starts a bit slow. Also, some math routines are suboptimal for Mono's JIT. It has two options: to run Mono in AOT mode or to use LLVM's JIT. Using LLVM JIT will see that the startup is even slower. Using AOT mode will decrease a lot of the performance (as Mono --aot mode uses one CPU register for generating PIC code). At the end, it will notice that the game will have small hiccups because the SGen GC makes from time to time to skip a frame.

Using CR, in fact the things will be a bit different: there is no need to setup anything else than the optimization level. Considering that a developer may want to wait even let's say half of minute for the final executable, will pick the highest level of optimizations. This will mean that many global operations will happen like: devirtualizations, inlining and removal of global code, constant merging over the entire program, etc. The code will not be PIC code, but will use all registers and the optimizer can be as good as C++ compilers will get to that moment. Because the code is using reference counting, it will mean that pauses are much smaller (no "freeze the world" needed) and there are optimizations (already as of today in CR's codebase) to mitigate the updating of reference counting (CR is using escape analysis).

Some problems will still remain for the user: as is using reference counting, the developer has to look for memory cycles, but on the other hand, these cycles are easier to find, not because CR does anything special, but because C++ tools today do find really easy memory leaks (and CR does name the functions the same as the C# original name is). In fact is it easier to find a leak using ref-counting than a GC leak: start Visual Studio using Debug configuration of the generated C++ code, and at exit of the program all leaks are shown in the console.

At last, CR can add as many things as the developer community will contribute because CR is written in C#, is easier to handle high level optimizations than would be to hack them into Mono runtime (which is C) or in .Net (which is impossible, as the components are not public for modification). Some optimizations that can be done explicit and require much less work from the coding standpoint is marshaling and PInvokes (an area I would really love that CR to improve). When you call a method in a DLL/libSO, in .Net (or in Java for that manner), there is some extra "magic" like some pointers are pinned, and some conversions between pointers occur. In contrast, is it possible that this marshaling (and certainly there is no need of pinning) to be removed all-together in some cases. For example if the developer knows that it uses OpenGL32.dll (or libGL.so), to link using -lOpenGL32, using a special compiler flag. This is not a big win for some libraries, but is big for others, because it doesn't need an indirect call.

So in short, think about CR as a C# only VM that takes its time to compile CIL. At the end it outputs some optimized C++ which can be optimized further by some modern compilers. It is easy to hack (according to Ohloh is only 17K lines for now and it has support for more than 100 IL instructions, it includes a lot of reflection code, more than 20 optimization passes, etc.)

Wednesday, November 6, 2013

Code Refractor 0.0.1 - First release is available

After close of 7 months of development (the project I started it at the start of April, before making it public), first release of CodeRefractor is here.

What it contains compared with the very first introduction? Many CIL instructions and some generics support is there. Compared with introduction of CodeRefractor which was basically able to set fields, arrays and math, and compared with this, I will try to summarize what happened in between:
- most of development is documented, look into Documentation folder (using the .docx format) about the main subsystems
- compiler optimizations are not naive implementations: they are more powerful and they have been using a use-def so the optimizations are understanding the code more precisely; similarly the compiler branching is based on usages and definitions so many intermediary variables are removed naturally by the compiler; an inliner implementation works intelligently
- the runtime is mostly written in C# and is possible to add more methods by writting C# and C++ annotations
- many CIL instructions are implemented, making C# only implementations (that do not depend on System runtime too much) to work. The biggest not-implemented I can say that are delegates. Some partial implementations of generics (very limited), unsafe code is done
- a primitive logic of class hierarchy analysis is done, and as implementation will mature, expect that many devirtualization would be done safely and correctly by the compiler
- unique (in my knowledge) for CIL implementations, the purity and escape analysis allow more optimizations to be really aggressive: calling pure functions with constants is the same as calling the result constant, so Math.Sin(0) is always evaluated as 0 (zero), or for a program taking care of the fact that CR does escape analysis, that objects are allocated on stack or the smart pointers are converted (safely) to raw pointers improving runtime performance. This can generate final code that faster than .Net programs.
- the optimizer works like a global optimizer which makes possible some inter procedural (entire program) operations: program wide merging of array data, strings and PInvoke methods makes your final program to be smaller

Even many things work, the release has many cut corners, and some parts were written very bug prone, so expect that the resulted C++ code to not compile, and as the runtime has no classes in themselves, also expect that no non-trivial program to compile. If it compiles, it should run fast.

After you extract this release, which is just a .zip file, you should copy a GCC distribution. For simplicity I'm using the great Orwell's DevC++ and I copy the C:\Dev-Cpp\MinGW64 as: <CodeRefractorPath>\Lib\Gcc Please notice that you will have to rename the folder at the end, but other than this, it should work just fine.

Anyway, many things are missing and everyone is encouraged to test it and to implement small stuff starting from the GitHub project: every small piece in place makes your program to be closer to working, or if it is working already to work better and more stable.

For questions and feedback you can use Google Groups page.

Friday, November 1, 2013

Opinion: Native and Managed, what it really means?

Microsoft and Virtual Machines world do use many definitions and based on emphasis they can say things that do make no sense to some extent: "native performance", "performance per watt" and in my view is all based on undefined clearly terms.

Based on this I notice that this emphasis changed even more with phones and I cannot clarify without using definitions which again will break the purpose of this entry, so I will go on technology side:
- native is in many people mind associated with Ahead Of Time compilers meaning that you write the code in the language of choice, and finally will create a final executable code that runs directly on the CPU machine
- managed/virtual machine is when applications are written in something intermediary, and before executing there is a runtime that reads this "bytecode" and compiles on the fly

Because of how compilers work, compilation is expensive, and it means that most virtual machines do compromises to make possible to have interactive applications possible. This is why the virtual machines are somewhat lightweight on compiling code by reducing the analysis steps that they perform on the code. This means two things: big applications typically start slower than their "native" counter part, and in many cases the compiled code quality is a bit weaker so the code will run some percent slower up-to some times slower (more in this later).

Based on this view, is it true that there is a managed application world that is so slow than a native performance world? Of course, but as many answers are in life it depends:
- most of the things you see on your screen depend on GPU (video card) to be drawn so even the slowness of a virtual machine is there, if it is done on a separate thread, the animations may work independently
- most of virtual machines do compile hottest of the hottest (including JavaScript VMs which today tend to use a separate thread/process to compile JS) so for simple/interactive applications you will get good (enough) performance
- some VMs do have parts of code compiled into "native" code, for example using NGen, and even the NGen is not a high quality code generator, is good enough, but also it makes your application to start fast
- VMs do allow to use native code directly so if a loop is not well optimized by the virtual machine, the developer can use native code that runs as fast as native code
- VMs tend to have a fast memory allocator, so an allocation heavy application may run faster than a native application, if the native application doesn't write memory pools or other caches to speedup the application

In this hybrid world, performance is less meaningful than it would be if we talk about full Java applications 10 years ago. Also, it is even less meaningful as GPUs and computation on GPU do matter a lot in computation.

This is why when Microsoft launched: "Going Native" campaign puzzled me... the easiest way to achieve this (in "managed" world) is to compile the bytecodes upfront using NGen. They are using this in Windows Phone 8 as your MSIL code is compiled in the cloud.

People were using C# not because performance was not good, but because it was good enough. C++ started to be used because Microsoft did not invest into improving their quality of generated code for a long long time, and in comparison, C++ always did this at least with work into Visual Studio's Phoenix backend, by GCC team and of course Clang/LLVM team.

The last issue of Managed vs Native is that people use it just to make it as a marketing pintch, like here: https://www.youtube.com/watch?v=3vGV4fF4KCM (minute 34:40), it is contrasted Web technologies like JavaScript with "True Native". And even we disregard the word "Scripted", which is the performance profile for a JS application? If you use Canvas, it will be hardware accelerated today, if you load code, most of it will be set as dead-code, and it will run like in 5x of the speed of fully optimized code, which if it is run like 1 time, to sum all items in a column is really instant.

Code Refractor is using some decisions found in other opensource projects like Vala or ObjectiveC (to use smart-pointers instead of GC) and from Java (escape analysis and Class Hierarchy Analysis) or even from GCC (pure function annotation) and Asm.JS (use a subset of code and optimize this properly, then add another feature, and optimize this new one properly) because sometimes performance matter and is done by design. The importance (in long term, as right now is just a very limited pre-alpha) of CR is in my view the approach: the "native" step is basically spending long times optimizing upfront.

What CR can bring is not always performance, but more capabilities: using Emscripted (a C++ to JS) or Duetto you can start from C#. As for myself I hope it will be used at least to migrate some Mono applications like Pinta or Tomboy and not use something like GNote (a C++ translation by hand of Tomboy) where Mono is not desired (Fedora Linux anyone!?).

Tuesday, October 22, 2013

Status update: Part 8

This post is fairly weak in content as I noticed that there are some bugs introduced in the newest bits, but there are also some improvements:
- the generated code does not use for generated types namespaces. Some name mangling were fixed because of this
- the duplicate runtime methods do not appear anymore
- documentation is vastly improved: as I use CodeRefractor as a bachelor paper, I described all the optimizations and various components of CR in an extensive manner. There are 40 written pages with content that may be interesting. If you are interested (at least as a lecture) about various internal parts or how the code is made, I recommend reading the Documentation folder.
- generic class specialization works (partially at least) and name mangling is better of managing generics
- there are also advances (small ones) into supported delegates, but is still not ended. The best part is that the delegates code is isolated and when I will have time I will be able to finish them

Still the main area of focus remains for now: bug fixing and/or code cleanup.

I want with this post to thank Jetbrains by offering a (free) license of ReSharper to advance CodeRefractor. I think that R# is a great tool in itself and I hope you will test it if you are not using it you will see how it improves the code quality of your code base.

Monday, October 7, 2013

NBody: can C# be very fast?

As you read in the previous blog post, performance of Java and .Net can be fast (or even faster) than C++ compiled code. Anyway, this was a reason for me to look the performance discrepancy and to implement three "remaining" optimizations that made the gap of performance really big:
- escape analysis: allocations for objects that are not escaping are made on stack
- common subexpression elimination
- loop invariant code motion (LICM)

So what it means in performance terms? I will not put exact numbers, but with the best C++ time, you will get a bit less than 1300 ms (on Linux), and around 1400 ms on Windows (on both VC++ and MinGW).

Why these optimizations were so important:
- escape analysis removes for cases of parameters of functions necessity of increment/decrement reference counting
- having a lot of small items (in expressions) that do repeat, they will be precomputed once. This part also work over function calls (if the functions are evaluated as pure).
So if you have a rotation matrix, and you compute against cosine (alpha) and sine(alpha), you don't have to cache the sine and cosine, the compiler will do it for you automatically.
- LICM (Wikipedia article) will work as common subexpressions, but in case your code has expressions that do not change over the loops, they will be executed at the start of the loop once, and not at every iteration. This optimization works also with pure functions, so if you make a function call, this function call will be moved outside of the loop also.

This also means that I will not work (excluding there are bugs) for optimizations for some time, but you may try to generate code and the result "should scream".

Tuesday, September 24, 2013

NBody: can C# be fast?

One question for blog readers is: "how fast could be my C# code if is written in C++?" And in many cases people come with their "numbers" like:
- Java JIT is faster than C (or C++)
- C# is faster than Java because it has structs
- C++ is faster because everything in this world is written in C++ for performance, for example games, etc.

Of course this blog will come with it's personal spin, so if you have a strong opinion about who expects to win, please ignore what is written next.

A benchmark I looked to optimize as it has many common operations for a computing kernel is the NBody benchmark, because it includes:
- mathematical operations
- math intrinsics
- somewhat complex array accesses and iterations
- is written in a cache friendly way(like the reads are mostly sequential) so if you have many memory indirections the performance will be hurt
- it doesn't depend on a complex library functionality (like Regex would do)

Given this, the NBody sourcecode which is hardcoded to 5.000.000 will run in:

(update, there is a last minute fix, .Net times were included with non-release builds, the best time is now 1550 ms)

Runtime	Time(ms)
.Net 4.5 64 bit	1550
MinGW 4.7 32bit	2860
MinGW 4.7 64bit	2840
Win JDK 6 -server 32 bit	1500
Linux JDK 7 -server 64 bit	1444
Linux G++ 4.7 64 bit (-O3)	1494
Linux G++ 4.7 64 bit (PGO)	1378

Some people will notice that I didn't test MS VC++ (on Windows) or Clang on Linux. In fact, I did, but maybe I was wrong in my setups and MS VC++ was slower than MinGW on Windows. Clang++ was also slower than GCC, like would run in 2 seconds (so was like on my up-to-date Mint Linux). The point is mostly to test (best) managed vs (best) native compilers. Also the test doesn't test C# vs C++, but C# against a C++ translation of this C# code made by CodeRefractor, which can be interpreted in any way a reader wants.

So what I found in this testing:
- if you do low level math, Java may save the day, it is by far the safest choice, and is very easy to setup and the conversion of the C# code
- if you know what you're doing, you can get much faster performance if you use the proper OS/compiler in C++ world. Even I didn't use Intel Compiler, or I have access for now just to MS VC++ 2010, using MinGW and Linux can get you a tad faster code (at least if you count the code is written somewhat low level but neutral on optimizations)
- MinGW will give virtually the same performance in NBody on 32 and 64 bit (which was a big surprise for me) at least on this code. Maybe is a bug in my setup, but I pick the best time for this test, and in general I was getting sometimes slower times on 64 bit
- using PGO, 64bit, GCC, -O3, Linux gave at least on my machine the best performance.

For people to reproduce these tests, I have the following machine:
i5 540M 2core @ 2.4 Ghz
6 G RAM
The source is under this revision (of the output C++ file) which reflects the NBody benchmark which is the result of CodeRefractor
Windows 7 64bit
GCC is 4.7.2 under MinGW (part of DevC++ distribution)
VS 2010
.Net 4.5
For Linux: Mint 15 (with updates)/GCC 4.7.3
Best running GCC arguments: -std=c++11 -ldl -O3 -flto -mtune=native -march=native -fprofile-use
where:
-O3 = level 3 of optimizations,
-flto = global optimizer
-mtune=native = optimize for my machine (I think it may not matter)
-march=native = use instructions of my machine (I think it may not matter)
-fprofile-use (using PGO running profiles)
In fact without mtune and march (on 64 bit) the performance is basically the same, but I put these parameters to make sure that users who read this blog and try to reproduce to get similar level of performance.

Sunday, September 22, 2013

Status update: Part 7

New instructions are implemented but somewhat are done partially:
- ldftn instruction is implemented (a bit hackish) for now, and is a crucial part of making delegates work
- similarly, ldelema instruction is implemented which is very important to make unsafe code (with pointers) to work

But on the same time, I reorganize the code to work around the type system and this will make that triggering this code will do a code that will (basically) work for these instructions but the program to fail by including partial code.

In time they will work, but are not in proper state (as for now).

As an experience of these implementations I found that CR needs a lot of small fixes and testing, and as I will add support for more instructions (or bigger features), will likely slow down the development because I will have to make sure it will not regress.

If you have time and you want as an user to invest it, try to do the following:
- take the sources from: https://github.com/ciplogic/CodeRefractor (you have: "download zip" option, if you don't know how to clone, or you don't want to clone it)
- make a small program inside SimpleAdditions solution
- you will likely (if you will do it right away) get some compiler errors
- report them either here (on the blog) or on the GitHub issues page: https://github.com/ciplogic/CodeRefractor/issues

It is very important for me which (small) programs you're running as I can focus testing them. If you know C#, even you don't understand how a compiler works, you can help this project in small or large.

I make this post of asking for help as CR is not backed financially by no entity (the single part is backed is only my free time), so just contributing to it will possible become a project useful for users (like for example some developer to write OpenGL games in C#, and later to recompile them to a platform with no support of C#, or to not pay a license of Xamarin Studio).

Tuesday, September 10, 2013

Status update: Part 6

This update is critical for performance of math code and it has also fixes in multiple ways, so I recommend you to play with the Git source in Github:

- the logic of mapped types (bootstrap types) is a better story. Strings are implemented right now directly in .Net;
- a simple Class Hierarchy Analysis with default support of devirtualization is done (CR doesn't support at all devirtualized calls for now, but if you have a sequence where .Net reports a virtual call, and CR can detect the type, will remove the virtual call and will make it a direct call). This is done per-instance cases;
This makes that this program to work:

            var charData = new[] {'H','e','l','l','o',' ','w','o','r','l','d'};
            var s = new string(charData);
            Console.WriteLine(s);

Sure, not impressive, but it require many parts to work in place;
- all objects are constructed using make_shared, this means a speedup in loops that you allocate multiple objects
As more parts start to be in place, I would expect that in 2-3 months from now, I will make a first release (maybe on the New Year!?) milestone. I hope to make StringBuilder, String and File class basically working for some common operations.

A mini-roadmap for the next release:
- StringBuilder.Append(String) will work
- Bug fixes will be directed into higher level of optimizations with the programs using StringBuilder

Thursday, August 29, 2013

Status update: Part 5

This is a bit earlier update but noteworthy:
- optimizations are rewritten using an Usages&Definitions (although is not an Use-Def chain) which makes optimization step more precise and with less duplicated code
- the speed is put on par with simple optimizations that were before
- boolean operations and all found simple expressions are evaluated at compile time
- similarly fixes are added for pure functions evaluation (functions with mathematical expressions and simple operations) so they are called with constants

So, all things that were planned are done before time.

There are two areas I am planning to work in longer future, one is a regression and one is a new feature:
- strings are not working from the moment I removed the backed C library support, I will need to workaround some C speciffic using the CR's OpenRuntime design
- Delegates, and Delegates from PInvoke

If you find some areas that you think are more important to work at, write me privately at ciprian (dot) mustiata (at) gmail com.

There is still no release done (no 0.01) and if you know how to do installers and have time to pack the release, it would be great as I will publish it as a release. For now fixing bugs is for me more important than a release (not to say that release is not important, but as there are still critical features missing, some people will complain for it's state, so I don't think is releasable, excluding if someone wants a release to contribute back),

Thursday, August 22, 2013

Status Update: Part 4

This month was hectic for me, but either way, you can see some updates in the CR.

* first and foremost, there was a bug fix so that PInvoke calls to other than StdCall convention call to work. This bug is critical to have an working PInvoke. As PInvoke is fixed, the other big missing part is: Delegates and a way to call a callback using a delegate (a fairly common pattern in wrapper libraries like OpenTK)
* a simple purity checker for functions was written, so if you have simple math operations in a function, you may expect that the function if is called with constants will be evaluated at compile time
* a bit improved framework for optimizations analysis was added, though the code is not as good as the previous generated code, but willl catch up. The noteworthy optimization: "remove dead store" is implemented, which will make the dead code elimination more aggressive
* some bugs were discovered and they will be addressed in the next iteration
* I will define the optimization part as a thesis project, (a thesis which can be read inside the Git repo, for interested) which will mean that in many cases the optimizations (up to one year from now) will be more robust and more powerful

What I plan to do next:
* bug fixing
* catch up on optimization front
* make sure that (almost) all simple math expressions are evaluated at compile time. Right now, small misses are here and there, like:
x = (double)2;
is not rewritten as:
x = 2.0;
which in turn disables in some codes other optimizations

Sunday, August 4, 2013

Opinion: C/C++ is the today's assembly on matter of performance

The world relies on C language, so people that care to inter-operate with external components has to consume C. This makes it easier to target C as an output. But targeting C (or C++), gives to you in many ways better performance than writing your own assembly: the today's C compiler optimizer pipeline include many advanced optimizations including but not limited to:
- very good register allocation
- very good local and dataflow optimizations
- very good math performance
- fairly weak (unless you enable link time optimizations) inter-method optimizations

So writing a compiler that targets C (or like CR which targets C++), means that the code can focus on other important parts which are fed to the C++ compiler:
- reduce redundancies of smart-pointers (the assignment of smart pointers is expensive, and a compiler will not simply do it for you - the C++ output without any optimizations in NBody is like 8 times slower than the equivalent .Net time, but after removing the redundancies, the code in C++ gets faster)
- simplify some logic items to remove some variables in the code, so C++ compiler will have less to thinker about
- do some basic dead-code elimination and dataflow analysis, so at least if the target C++ compiler is not that good, the C++ code to be compiled to not be fully inefficient

There are some cases when assembly was used as a performance benefit, and you didn't want to wait the compiler to add support for the instructions you were missing, or worse, the compiler will never generate the code using these instructions. I'm talking here about SSEx or AVX. But using an up-to-date compiler gives you this: Visual Studio 2012 gives to you support for AVX (or SSE2), GCC too, LLVM too. For free, for the loops you made them compiler friendly. In fact, not writing them in assembly, is really a great thing, because the compiler can inline your method, and most compilers will not inline assembly only methods.

Similarly, writing the things up-front in C++, will make your code work on platforms that maybe were never intended to work in the first place, like Arm64, or maybe are very little changes to be made.

The last death stroke in my view, is that today's CPUs are so different than they were let's say 20 years ago, and the main difference, is that the processors today are out-of-order, not in-order. This means that instructions are mostly executed speculative and you're most time you spend in code is "waiting", for memory to be written, for memory to be read, etc. This makes that optimizations of "shift left by 2" or similar minded optimizations to not give any performance benefit, but optimizing your data to fit into the L1, L2 and L3 cache, can give much more speedup sometimes than writing SSEx codes (look to this very interesting talk).

This is also why CodeRefractor at least for the following months will try to improve its capabilities with just a small focus on optimizations, and certainly they will be on the high level. The feature I'm working now is to merge strings all over the project, so they will give a better cache locality. Will speed up greatly the code? I'm not sure, but the performance that C++ gives it from let-go is good enough to start with.

Wednesday, July 31, 2013

Status Updates - Part 3

As I was on vacations, I did make smaller tasks in the free time, but there are some note-worthy updates, mostly in optimization front:
- there is a [PureMethod] attribute that you can mark functions. If this attribute is found, the function is considered pure, and as a consequence, if you call it with constants, at compile time, the code is evaluated. It will be great if in future the functions are computed for purity, but this is a longer plan (is possible to be done, but are many cases)
- there is an inlining code possible (at least for simple functions), but the optimization is disabled as it requires a lot of testing. Anyway, this opens a lot of possibilities on matter of performance: if you have a call of a function with a constant, and this method is inlined, more optimizations can successfully occur. In the medium plan is to bug-fix it and test the inliner to work with most small cases
- the compiler optimizer is split into parallel and serial optimizations. The good part of it, is that as more functions are initially defined, all cores are used to compile every function into cores. The inliner (or future purity computer) are serial optimizations. This reduces the compilation time of NBody (on my I5 first gen) from 200 ms to 150 ms of generating the C++ code, still the C++ code compilation takes longest
- the function bodies are defined like a sequence of simple operations. So, optimizations that do delete one item, are rewritten to be way faster by doing the deletes in batches
- unit tests are a bit weaker right now, but they do compile/run much faster. They test the capability of the compiler to give an output, not the execution output. They run now properly, so the unit testing is working again

So in short, you will get the same code (if you don't mark it with [PureMethod] everywhere) faster.

Added code to reflect APIs, it will be needed to generate stubs for OpenRuntime code. This code needs some love, and if there are any readers interested, would it be great if someone can look into generating some library empty wrappers.

Future plans:
- enums support
- string type improvements (it depends on enum in part)
- string merging - all occurrences of the same string should be defined just once in a string table
- (maybe) fixes into inliner - at least the inline of functions call overhead should not exist at least in some cases that can be detected: empty functions, simple setters/getters
- (maybe) a purity checker - computing purity gives speedups extensively if the developer uses constants. So if the functions can be computed for their purity (without [PureMethod] annotations), when called everywhere with constants, they will give zero overhead on execution time

Tuesday, July 9, 2013

Status Updates - Part 2

This entry will be brief (as I will be soon on vacations):
- switch CIL instruction is working, this means basically that you can write switch statements
- code was moved to GitHub, because of ubiquity of it
https://github.com/ciplogic/CodeRefractor
- I was looking into delegates code and I will postpone it as it looks as is a big feature, so it will be unlikely I will have an implementation early enough to be useful in the next two-three months
- I've removed the need of having a static library written in C++ that has to be linked alongside with the code written. All the code is taken now from the OpenRuntime assembly

Future (planned) developments:
- in one month from now I will try to fix the unit-tests (as many refactors were done, the tests were pending fixes). Automatic testing is critical at least in future, as some components (like optimizations) do interact and it is critical that they work correctly
- add annotations to functions which are optimizer friendly (for people knowing what are they about). I am thinking here about marking functions without side effects. This is critical for calling constant functions and to inline their value call. Let's say you call Math.Cos(0) and you will want that this call to not be executed, but to be replaced by value 1.0 . Similarly, string functions can be computed if their parameters are constant. Also, it is important to allow in your code to annotate your code with the flag "no side effects" and the runtime will trust you and it will evaluate parts of your code at compile time.
- look into string merging (like it was done for ldtoken buffers): if you call in more places the same string, this string should be loaded from a string table, not replicated over and over again

Longer term:
- I would love a "hello world" SDL (maybe OpenGL) application to work
- a PInvoke Pinta plugin to work (without delegates)
- extract resources
- enums
- structs code
- Make a small "IDE" integrating AvalonEdit (on Windows) or something similar for making possible a fast testing cycle. Or other way to be easy for developers to reduce the cycle of testing their small applications if they work against CodeRefractor

Monday, July 1, 2013

Status Updates (I)

More work has been done to advance the compiler and the runtime but for the first time inside the repository there are some user-visible changes (still there is no downloadable package but this will be improved soon).

Bug fixes:
- bug fix: there were cases when using the CROpenRuntime C# and the C++ methods in the same source code: (like Console.WriteLine(double) and Console.WriteLine(float) ) will make the final code to write the WriteLine(double) twice) , right now the WriteLine(double) is just once added.
- PInvoke calls for very simple calls are executed correctly. For loading the native code it is used a Windows only (LoadLibrary, GetProcAddress) implementation (which works 32 and 64 bit), but the equivalent Linux/OS X (dlopen, dlsym) was not yet done. Also, there is no marshaling yet
- ldtoken: this happens because when you initialize a long array using array initializers, .Net will use a memory copy from an assembly address. This requires a constant array table. Because we are executing ahead of time, the implementation does one thing which .Net implementation doesn't, namely it merges these data, and the executing code will point to an index in this array table.
- in the past the dispatch of instructions was done by their string upcode. But right now, at least some of them are using the direct value upcode. This translates into a bit faster lookup. I will move more instructions in the next iteration.
- the compiler starts with Mono and MonoDevelop from latest Linux Mint version (Olivia): small version changes in solution were done. The Linux support is not complete, but for people wanting to try, test and tweak it, it is a small step in a small direction
- there is an Inno Setup script. This will mean that there will be somewhat soon an installer.
- there is a compiler launcher UI:

Picker of input assembly and output exe

Pick the compiler options

- the command line logic is done, meaning that you can pick right now the input/output assembly, an assembly for reading the runtime (most likely the default will be fine)

As I will be in holidays, most likely July will be a fairly empty month in progress, but most likely I will go into bug fixing, and I will try to smooth the way for the first installable package (Windows only).
The longer time roadmap would be:
- support for unsafe code
- try as a target to compile a Pinta effect (which is not multi-threaded).
- support for switch keyword (opcode)
- (later) add support for delegate and MultiCastDelegate. Without it, most of goodness of Lambdas, Linq, or just calling callbacks are not possible.

Sunday, June 16, 2013

CR: Status updates - bimonthly updates and a roadmap

CodeRefractor is an enablement technology which allows developers to use in some cases C# (or any .Net languages) applications using C++ runtime (and no other .Net or Mono dependencies). This blog entry is to show a Roadmap and future plans. Also, for people that are interested in it's progress, I will try to give a bimonthly update.

As a mini legal note, I want to say that this Roadmap is subject to change and the info presented is directed to show parts that I found them interesting.

So what was done in the last 10 days?
- the optimizations setup is program wide: it is put in a part of logic that will handle the command line parameters. In the past, the coder had to set at every function compilation step which optimizations were wanted. So in future, the command line will set all optimizations (based on optimization level)
- the methods do not reside in class header anymore: this makes possible to have a saner looking code, as classes will store only their state. There is a clearer name-mangling scheme.
- types (System.Type, and System.Assembly) are mapped to classes that will be in future serializable: this is critical for future plan to to not compile everything at every compiler instantiation
- added CodeRefactor.OpenRuntime assembly: this part is critical to define runtime functions in C#.
This sample class maps System.Math class logic. Not all methods are mapped, but is a clear way to write a code that doesn't imply that users have to write and tweak a C++ library (as it was done before).

    [MapType(typeof(Math))]
    public class CrMath
    {
        [CppMethodBody(Header="math.h", Code="return sin(d);")]
        public static double Sin(double d)
        {
            return 0;
        }

        [CppMethodBody(Header = "math.h", Code = "return cos(d);")]
        public static double Cos(double d)
        {
            return 0;
        }
        [CppMethodBody(Header = "math.h", Code = "return sqrt(d);")]
        public static double Sqrt(double d)
        {
            return 0;
        }
    }

This generator code is cleaned a lot. The mapping of types made things a bit more complex, but in future I expect that the code will be more simplified.

    [MapType(typeof (Console))]
    public class CrConsole
    {
        [CppMethodBody(Header = "stdio.h", Code = "printf(\"%lf\\n\", value);")]
        public static void WriteLine(double value)
        {
        }
        [CppMethodBody(Header = "stdio.h", Code = "printf(\"%d\\n\", value);")]
        public static void WriteLine(int value)
        {
        }
        [CilMethod]
        public static void WriteLine(float value)
        {
            WriteLine((double)value);
        }
    }

The other improvement is that some methods can be mapped as C# methods. So the case of: Console.WriteLine(float value) is mapped in fact to CrConsole.WriteLine(float value); which in turn calls CrConsole.WriteLine(double float); which itself is a C++ method. If the code calls both methods - WriteLine (double) and WriteLine(float) - the code is linked once

Roadmap
So the code as it is, it looks to support some .Net programs (1.1 ones), so what seems the next part that is important to be done?
- make possible to save/restore a compilation result. The plan at least is that if you compile a huge project (even right now it doesn't work, as the CIL support is not so extensive as a part of CodeRefractor side) is to reuse the assemblies are already compiled. This saves time and make easier to debug cases. As CR represents the data in an intermediary format, this is somewhat easier to be debugged and tracked
- make PInvoke to work: PInvoke is the next logical step of making "real programs" to work. The next item happen often with PInvoke so this will be also a high priority
- make Unsafe to work: this item makes possible to do under-the-cover operations, like Array.Copy to work with an optimized way. Also, it goes really often with PInvoke as many times PInvoke uses pointers to various structures
- make ldtoken instruction to work: CIL support is not complete, and one instruction that happen fairly often is: ldtoken. This instruction is executed to initialize arrays as raw data. This requires marshalling support, unsafe support (previous item) and better type representation as part of CR (this part that was started and partially done in the previous days)

What is not yet planned (where you can help):
- make a sane command line parsing: right now the input assembly, output .exe are hardcoded. The same is about optimizations, output folder, picking the compiler (is hardcoded to GCC that has to be inside the folder: C:\Oss\DevCpp).
- generics: they are very useful (and they they are going to be implemented, in an undefined time in future), but Generics are almost useless without a generic collections library. And this will take time.
- Linux/OS X/support: it should work with Linux (in fact the changes are so minimal, that I ask even myself when I will add support for it), but MonoDevelop/XamarinStudio does load the solution but when a crash happens, the debugger did not work so gracefully. So setting all parts take a lot of time of configuring and tracking down issues. If you have time and you want to try it, I am ready to support you, but this task should work without any advanced C# experience
- String and Steam support: there is a minimal support for strings, and PInvoke support will simplify and make possible some string handling, but String is an extensive class, used in very many places
- Reflection: this is like Generics, it'll certainly be done (once), but the System.Type reflection requires many systems to be in place, like a Type's table, and many runtime functions to make it happen
- an installer: this depends on Command line parsing (at least), but is fairly straight forward to be done
- many more...

Friday, June 7, 2013

Introduction to Code Refractor CIL to Native AOT

Welcome to CodeRefractor, an experimental project which compiles MSIL/CIL opcodes to C++ and after this it automatically generates a native (binary) version of the code.

The idea of using this convertor is to make possible that the MSIL code to run on machines that do not have Mono or .Net codes and is using as much as possible C++ paradigms.

Some questions I think that may occur for an interested reader and their answers

Is CodeRefractor a code-translator?
By a code-translator meaning that it translates the operation opcodes to C++ directly. The exact answer is no, it is a full compiler which uses an intermediary representation.

In a very brief steps the code, CodeRefractor does the following:
- reads an .Net assembly using reflection
- starts from the entry point (Main method) and starts reading CIL bytecodes/operations
- these operations are rewritten in an intermediary representation
- the intermediary representation is (optionally) optimized which simplify simple expressions and redundancies
- the intermediary representation is written into C++
- a C++ compiler is invoked and is linked with a static library that have some simple operations made into it and generates the executable form

This native executable should behave as much as possible as the original .Net application.

Most similar projects that do similar design are mostly in Java world:
- Excelsior JET
- GCJ
- RoboVM
- Mono --full-aot mode

This project will not be able to generate a C++ version of any C# (or Boo, F#, etc.) language and some specifics will make it to behave different (like many times slower), but a program written with the consideration of the design of the CodeRefractor should run similarly with a C++ code (even is written in C#).

How will CR handle licenses to not break them?
CR will not scan any GAC assemblies (and system ones too) for instructions, but it will let the user to add their own replacement implementations. We recommend for users to not read code that the license does not allow them to scan it. We plan for this project to use as a "backup" the Mono's BCL implementation (the CIL bytecodes) when something is missing.

C++ has no GC. so how memory management is done?
The code is using smart pointers and it requires a C++ 11 compiler (to make sure that the compiler supports smartpointers). This also means that users have to care about memory cycles. So, as an easy way to break cycles for now is to set pointers to null. In future plans, weak pointers will be supported.

Does it support Generics, Linq, Reflection, method Math.Cos, ... (put your feature here)?
No, it doesn't and some of the limitations will be fixed as the software evolves. We hope that feedback and help from community will address as many of these features as early as possible.

What does it work?
Hello world kind of samples, the most complex version is NBody benchmark. This means in short: classes, static and not static (non-virtual) methods, many primitive types operations, while/for loops, array types, fields (but not properties yet), constructors

Where is the project?
You can study, read, contribute to it here. The license file is not yet written, but it is: GPL2+ for compiler and MIT/X11 for libraries. Or rephrased (for companies) anyone which want to use it is: you can use it commercially, but if you make changes in the compiler, you have to contribute them back. If you let the compiler as-is, you can make your changes in the libraries for all your needs. We are using basically the same license as Mono project.

Legal note: we are not endorsed nor we have any relation with Mono project or Microsoft.