Thursday, August 29, 2013

Status update: Part 5

This is a bit earlier update but noteworthy:
- optimizations are rewritten using an Usages&Definitions (although is not an Use-Def chain) which makes optimization step more precise and with less duplicated code
- the speed is put on par with simple optimizations that were before
- boolean operations and all found simple expressions are evaluated at compile time
- similarly fixes are added for pure functions evaluation (functions with mathematical expressions and simple operations) so they are called with constants

So, all things that were planned are done before time.

There are two areas I am planning to work in longer future, one is a regression and one is a new feature:
- strings are not working from the moment I removed the backed C library support, I will need to workaround some C speciffic using the CR's OpenRuntime design
- Delegates, and Delegates from PInvoke

If you find some areas that you think are more important to work at, write me privately at ciprian (dot) mustiata (at) gmail com.

There is still no release done (no 0.01) and if you know how to do installers and have time to pack the release, it would be great as I will publish it as a release. For now fixing bugs is for me more important than a release (not to say that release is not important, but as there are still critical features missing, some people will complain for it's state, so I don't think is releasable, excluding if someone wants a release to contribute back),

Thursday, August 22, 2013

Status Update: Part 4

This month was hectic for me, but either way, you can see some updates in the CR.

* first and foremost, there was a bug fix so that PInvoke calls to other than StdCall convention call to work. This bug is critical to have an working PInvoke. As PInvoke is fixed, the other big missing part is: Delegates and a way to call a callback using a delegate (a fairly common pattern in wrapper libraries like OpenTK)
* a simple purity checker for functions was written, so if you have simple math operations in a function, you may expect that the function if is called with constants will be evaluated at compile time
* a bit improved framework for optimizations analysis was added, though the code is not as good as the previous generated code, but willl catch up. The noteworthy optimization: "remove dead store" is implemented, which will make the dead code elimination more aggressive
* some bugs were discovered and they will be addressed in the next iteration
* I will define the optimization part as a thesis project, (a thesis which can be read inside the Git repo, for interested) which will mean that in many cases the optimizations (up to one year from now) will be more robust and more powerful

What I plan to do next:
* bug fixing
* catch up on optimization front
* make sure that (almost) all simple math expressions are evaluated at compile time. Right now, small misses are here and there, like:
x = (double)2; 
is not rewritten as:
x = 2.0;
which in turn disables in some codes other optimizations

Sunday, August 4, 2013

Opinion: C/C++ is the today's assembly on matter of performance


The world relies on C language, so people that care to inter-operate with external components has to consume C. This makes it easier to target C as an output. But targeting C (or C++), gives to you in many ways better performance than writing your own assembly: the today's C compiler optimizer pipeline include many advanced optimizations including but not limited to:
- very good register allocation
- very good local and dataflow optimizations
- very good math performance
- fairly weak (unless you enable link time optimizations) inter-method optimizations

So writing a compiler that targets C (or like CR which targets C++), means that the code can focus on other important parts which are fed to the C++ compiler:
- reduce redundancies of smart-pointers (the assignment of smart pointers is expensive, and a compiler will not simply do it for you - the C++ output without any optimizations in NBody is like 8 times slower than the equivalent .Net time, but after removing the redundancies, the code in C++ gets faster)
- simplify some logic items to remove some variables in the code, so C++ compiler will have less to thinker about
- do some basic dead-code elimination and dataflow analysis, so at least if the target C++ compiler is not that good, the C++ code to be compiled to not be fully inefficient

There are some cases when assembly was used as a performance benefit, and you didn't want to wait the compiler to add support for the instructions you were missing, or worse, the compiler will never generate the code using these instructions. I'm talking here about SSEx or AVX. But using an up-to-date compiler gives you this: Visual Studio 2012 gives to you support for AVX (or SSE2), GCC too, LLVM too. For free, for the loops you made them compiler friendly. In fact, not writing them in assembly, is really a great thing, because the compiler can inline your method, and most compilers will not inline assembly only methods.

Similarly, writing the things up-front in C++, will make your code work on platforms that maybe were never intended to work in the first place, like Arm64, or maybe are very little changes to be made.

The last death stroke in my view, is that today's CPUs are so different than they were let's say 20 years ago, and the main difference, is that the processors today are out-of-order, not in-order. This means that instructions are mostly executed speculative and you're most time you spend in code is "waiting", for memory to be written, for memory to be read, etc. This makes that optimizations of "shift left by 2" or similar minded optimizations to not give any performance benefit, but optimizing your data to fit into the L1, L2 and L3 cache, can give much more speedup sometimes than writing SSEx codes (look to this very interesting talk).

This is also why CodeRefractor at least for the following months will try to improve its capabilities with just a small focus on optimizations, and certainly they will be on the high level. The feature I'm working now is to merge strings all over the project, so they will give a better cache locality. Will speed up greatly the code? I'm not sure, but the performance that C++ gives it from let-go is good enough to start with.