Code Refractor - Virtual Machines/Compiler performance musings: 2016

Sunday, November 13, 2016

Fastest small web-server in Java, .Net and Node

There are sometimes small services which run and return very little data. Imagine a service which can return the number of users or a very simple text. This is important in the world of microservices.

So let's say you decided you want to make a service, and you discarded a non-HTTP option (which would exclude a binary protocol over websocket). Great, what next?

So I made some tests over the last days (with some help) and who or what speeds were achieved?

I used ApacheBench against loopback device, and I run against Spring-Boot to have a baseline. Apache Bench would run against Java/Spring Boot (on Windows 10) at around 5000 requests per second.

Using a very low level server using C#, I was running around 10.000 requests per second. Using Java with a raw http handler, the numbers were around 11.000.

Using JS/Node, my numbers are using the latest Node around 6000 req/s if it was using a raw low-level protocol.

Anyway, sometimes things are not so clear, so we tried running on Linux (although on a similar CPU, was not on my machine). Node was returning 12.000 req/s (but the machine had no Java numbers to compare with).

This made me suspicious and I was looking into ApacheBench and I found that AB had higher CPU usage than the server. Rewriting the web-client to be a simpler C# but with a real multithreaded client (ApacheBench works only singlethreaded), I noticed that the numbers got higher and even though the C# server and Java servers were simple raw-http handlers, they achieved on the same machine around 30K RPS on .Net, and 36.000 RPS on Java.

So what did I learn? If you have a simple server, for example for a microservice, I would discard SpringBoot, as Spring seems to be limited to around the speed of Node with Javascript. Use raw-http handlers written in .Net or Java. At least on Windows, a framework like Rapidoid had a very bad behavior, and it could be for many reasons, but it was not faster than Spring by a wide enough margin and .Net was always faster.

What should you choose between .Net and Java? I would say the line in the sand is very easy: if you like Java a lot, or .Net, there is only one choice. But if you like both (or you prefer the best on term of technology), pick .Net if you have a swarm of microservices (because a simple .Net microservice would use less memory, you can drop more services on the same VM/machine), as in 32 bit mode, the memory usage was around 10 MB (you read right, a web server with just 10MB of RAM usage, and in time of benchmark when many requests were asked, the RAM was never over 25 MB) per process, compared with Java with around 240 MB in Java case. I could say that 20% of higher performance for around 10X of memory is not a nice tradeoff.

Anyway, the code was done in a way that it has very few dependencies, so a "hello world" microservice, using around 30 MB it may not matter so much at startup, if it has to cache around 1 GB of data to respond very quicky. Though be prepared that a Java implementation would work a bit faster but in the same time it will be more taxing in memory.

Sunday, October 30, 2016

Delphi 10.1 - an unbiased review

If you are a developer, you may get ads targeting you and recommend to use Delphi or C++ Builder. It happens that if you install C++ Builder (named "Starter Edition") you may not install Delphi Starter or vice-versa.

As my interactions with Delphi, I was liking Pascal language which I've learned in highschool and I kind of like it today. I also used many free editions of Delphi (initially it was packed with Chip Magazine) like Delphi 6 PE (Personal Edition), Delphi Turbo (a slim down version of Delphi 2006). I had also been using a Delphi 2009 and XE3 (licensed for my previous company).

In short, I could say that I could notice a bleak quality around 2009 toolset, and I've been told that Delphi 2006 was similar, but I didn't notice it. XE3 left me a bit disappointing note also, being overly priced and so on.

Let's go back to review. First of all, let's talk performance of the compiler and the generated code. Frankly, it is "good enough" but expect it on the level of .Net (or a GCC -O1 level), and not around Java (which tends to be on level of GCC's second level of optimizations). I am not sure if I break licenses (as many things are concerned), but in general I found the code would be a bit slower in math related code. Even the compiler is generating not that good code, honestly is not that bad, it is though very quick. In fact, is is that quick, that I was kind of really surprised. Of course, I have an SSD, and so on, but really, I cannot notice the build times. (Maybe for the size of project).

What I could also notice, is that, applications do use very little memory, starting with Delphi itself (which is written in a mix of Delphi and C++ languages). Delphi uses at startup for a 'hello world' like desktop application 100 MB! 100! Comparing it with Lazarus, which use a bit less, as is a less featured, it uses in fact 42 MB. Maybe these numbers do not mean much for you, but maybe we should have the VS IDE. Visual Studio uses on my machine 150 MB. The bigger is the project, the memories also grow. I am not here to say that you should evaluate a tool for memory use, but it looks that the toolset got a lot of love lately.

I didn't notice bugs at least based on my limited usage. I notice though some fewer features that I got used with Java IDEs or Visual Studio, but it felt snappy and cool. Especially in refactors area, there are no refactors in the Starter Edition (or if they are, I did not find them). Lazarus has them!

Is it Delphi a good tool? Definitely! Would I recommend to someone? Yes, especially if you have some companies that cannot afford to install runtimes on customer machine, don't want a buggy IDE (I am thinking here about Lazarus, even though also Lazarus did improve there), and you want to get some commercial components and make quick an application.

Would I use it? Definitely not. I will write here all my noticed gripes, but many of them are related with my experience, so don't take them as a Delphi hating person:
- memory management is a lost cause in my view in Delphi: it is overly-verbose (only the phone applications have reference counting).

- there is no sane way to do something like Linq. For me Ruby looks to me very equivalent to Delphi (or Visual Basic), but the things split when lambda expresisons are used. I would love to write something like: lines := FileExtensions.ToLines("input.txt").Where(line=>length(line)>0).ToList();
- there is Lazarus. Even I personally don't think that Lazarus is a clear replacement for Delphi, it is a damn good tool using a very similar language. Excluding you are locked in with a vendor component which you cannot find a replacement to it, and it cannot be recompiled for Lazarus, I see very few reasons why not to report bugs to Lazarus and not to pay.

- even on the IDE side, I found many things nail biting: in JavaFX or WPF, or Gtk#, I know that the components do align with various containers.
- there is .Net or Java: for people that can afford to pack a runtime with their application (or at least to target .Net 4.0 which is supported from Windows 7 till today), even though the memory consumption of your IDE is definitely bigger, there is no point to not use these tools which are very low cost (or even free), and memory is really not a limiting factor for most of development.

In conclusion: if you want a tool with a very readable language, compiled, native, with low memory requirements that can be bundled on customer machines just as one .exe, Delphi should be considered. If you just want though a readable syntax and you would want that your code to run in everything from Windows 7 till today, I would recommend to look into Visual Basic.Net bundled with Visual Studio Community Edition (which is free of charge). If you want 90% of what Delphi offers, but you don't want to pay any license, look into Lazarus-IDE.

Sunday, July 17, 2016

Make a JDK9 minimalist distribution for your application

Java or .Net applications applications are small, but there has to be considered the fact that sometimes you would want to distribute the runtime and the application in a bundled package.

So, let's consider you want to make a JavaFX desktop application, here are the steps you should do to have an application:

Phase 1 - Install the necessary applications/frameworks

- Download JDK 8 (from here)
- Download and install Netbeans for JDK 9 (from here) - it will work without the JDK9 installed
- Download JDK 9 from Early Access packages based on your architecture. They bundle JRE and JDK (from here).- Create a new project from NetBeans (pick a simple not-Maven application)
- go to project preferences and add make sure you add as a secondary JDK the JDK 9
- set the project to have both language level and JDK to be configured to Java 9/JDK9

Phase 2 - Build the application against Java 9/JDK 9

- write the application as following:

package javaapplication1;

public class Main {
    public static void main(String[] args) {
        System.out.println("Hello JDK9");
    }

}
- add in the default package (in the root of src folder) the file: module-info.java and add this code inside:
module javaapplication1 {
    requires java.desktop;
}

Build and run. It should run nicely.

Phase 3 - make a JDK distribution

- change directory to your project path:
cd
- jlink your Java distribution:
jlink --modulepath "C:\Program Files\Java\jdk-9\jmods" --modulepath "C:\Users\<YOUR_USERNAME>\Documents\NetBeansProjects\JavaApplication1\dist\JavaApplication1.jar" --addmods javaapplication1 javaapplication1.Main --output AppRuntime

By default the Java runtime which is required to run a Java desktop application is: 188 MB (on Windows x64 architecture).

The AppRuntime (it looks to me that it doesn't include my original application, but it excludes most of runtime for a JavaFX application is only 63.4)

- run your application using the minimalist distribution:
"appruntime/bin/java" -cp "dist\JavaApplication1.jar" javaapplication1.Main

Bonus:
If you want to really save space, you can compress your distribution)

jlink --modulepath "C:\Program Files\Java\jdk-9\jmods" --modulepath "C:\Users\<YOUR_USERNAME>\Documents\NetBeansProjects\JavaApplication1\dist\JavaApplication1.jar" --addmods javaapplication1 javaapplication1.Main --compress=2 --output AppRuntime

Now your Java distribution is less than 40 MB!

If you don't use JavaFX (for very simple applications, but just java.base), your distribution can slim down a lot, to be more precise, the Java distribution is around 22 MB as of latest build (with compression). This includes the Java.exe launcher Server environment.

Why is it important?

The most important reason I see this mode of making a minimalist distribution is for many deploy scenarios where installing of a runtime is a big NO-NO. For example in server environments, locked down Windows environments where is still allowed to run local applications and so on.

It is even more helpful that eventually you can put everything in a zip with no hard configurations.

Saturday, June 25, 2016

Performance in Java vs .Net - 2016 edition

The motivation if writing this blog entry started when I commented about my experience with Java and performance and as being published in a .Net forum (in LinkedIn), some people thought that I missed some small or not so small points. Please go to Nota Bene section to see this entry's presumptions.

I found that Java is very often faster than .Net, so how can be this possible, even I work for at least 6 years C#/.Net and I have less experience with Java (even I know it well enough)?

Some reasons why I think this is true:
- .Net has a much weaker compiler. For example, if you have two identical functions, the JVM will likely generate better code. I wrote one identical "optimized" version of calculating prime numbers, taking into account that there is no math which Java can auto-paralelize. The algorithm was to calculate for 100 times the 150.000 prime mumber. (I calculated 100 times to have a clear average)

My computer times are:
.Net: 12719 ms
Java: 11184 ms

Code for reference (copy-paste it in your IDE and change minimally your code for C#: use Environment.TickCount for taking milliseconds):

This is very likely because that Java has a tiered compiler which have a better register allocator. This code is a lot of math and arrays accesses.

- most common runtime classes seem to be better tunned. At work I have a formatter which is used very often. Replacing "String.format" code with a "hard-coded" formatter, would speed up over 50x the Java Library code. I ported this code to C# (with virtually no modification):
.Net: 3703 ms
Java: 1491 ms

The formatter has less code necessarily, but the code has more conditions to optimize, some casts to be removed and so on. But also, even more, adding few common objects as String show that the GC in Java and working with String types is a bit faster.

- Java tends do do more aggressive inlining: by default, at least in what is public from .Net and Java, it looks like .Net considers as candidate for inlining functions up-to 32 CIL instructions (which have very close semantics with Java bytecode), when Java does have the limit of 55 bytecodes. (the second value I found it on a presentation of Java's JIT, and it was the default for Java 8 timeframe, not sure if any of these values can be changed). This of course it means that on a big enough project more opportunities for inlining are at one place

- Java has quicker by default .Net lambdas: this is true for .Net, not true for Mono (as far as the public presentation goes), but in Java all single method interfaces are compatibles with lambda implementations and if there is only one implementation in one context and it is a small method, it is a candidate for inlining.

- Java does have more optimizations which they run when the code is hot enough. The latest revision of .Net JIT does include some more interesting optimizations, like the option to use SIMD, but Java for now it can do it if code is SIMD-able, automatically. This optimization - of course - requires more time to do analysis, but if it is successful, can do wonders in performance. Similarly, small objects which are allocation in a small enough loop and they do "not escape", are not allocated on heap. Escape Analysis I think it is more viable for a large project with many small intermediate objects

- Java has by default a lot of customization of GC by default: you can choose heaps of gigabytes with no GC call. This can make wonders for some class of applications, and if you are aware how much is allocated per day, you can restart it out of the critical time your application making GC to be not involved.

I could talk many cases when I know that Java has some optimization which .Net doesn't have in particular (because of CHA for instance), but the point can be taken.

So, it looks to me, that the more complex, longer running code is concerned, I can get consistently at least 10% speedup in CPU related tasks, so why developers still consider that .Net is quicker than Java?

I have some options which could make sense:
- They don't compare the same things or on different abstraction levels:
If you compare Dapper (or Massive) SQL minimalist ORMs with full blown Entity Framework, you will likely see a huge loss in performance. Similarly, people do write ArrayList<Integer> (which is stored in Java as a list of object) and they compare with List<int> in .Net (which internally keeps raw intergers in a contiguous array). I wrote in fact a minimalist library which reifies some classes named FlatCollections in Java. I don't recommend using them if you don't care this much about performance, but if you do, you may give it a try
- Java starts slower, so it feels slow. This happens because Java runs initially everything in an interpreter, then compiles the hot code. This is also an important thing to take into account. If you compare full blown applications like Java FX one with a WPF one, the differences feel huge. But the startup lag doesn't make an affirmation about performance, otherwise we would write every program today in MS-DOS not Windows/Linux/MacOS that boots in seconds just with an SSD. I made Fx2C OSS project which reduces JavaFX startup lag, if you are into optimizing the startup time.
- feeling that when developers compare platforms, compare different abstraction levels mistakenly over different platforms. This is a really different point than first. Instead of comparing the most lean, close-to-metal "abstraction", some code would use Java's streams using IntStream (this would not create any dangling types) against Linq with Tuple (the Tuple<> types were defined as Class type, generating a lot of heap pressure and GC). This can be also reversed with List<int> (in .Net) vs ArrayList<Integer> (in Java).

Give feedback and I will be glad to answer to all criticism and corrections.

Nota Bene. Some points about myself:
- I am not paid and I wasn't paid by Microsoft, Oracle and so on. In fact, as a full disclosure, I participated to a Microsoft opened hackathon and I won a small prize (a bluetooth speaker) and if I recall right, I was passing by a Microsoft conference and they gave a "stress ball". I have no animosity against Microsoft per-se, excluding (maybe) that I like free software and opensource. I think that as of today Microsoft works very friendly with OSS community, so nothing to claim here
- I also have no interest in Oracle or any Java vendor (including Google) and as it is concerned, I never receive even a plastic ball-pen or anyhing of this sort
- I have opinions and biases but I try to be honest and direct about them
- I know that no comparison can be made without excluding many other components related with that technology. One of the most important as I see is: licensing. If you have a successful company and you want to scale your software, at least Java tools have higher individual license costs, but virtually zero horizontal costs, when in comparison, Microsoft seem to be a smaller cost per developer but with higher costs if you scale up your software. This is a subject which seems to change (like .Net Core) but as far as I understand, is not a finished software
- technologically, I think that C# is better designed as language, similarly it is the CIL bytecode
- I have around 10 years of working in software industry, covering C++, .Net/C# and kind of little Java (as of my current job).

Code for first example:

public class Program {
    boolean isPrime(int value, int[]divisor, int szDivisor){
        for(int i =0;i<szDivisor; i++){
            int div = divisor[i];
            if(div*div>value) {
                return true;
            }
            if(value%div == 0)
                return false;
        }
        return true;
    }

    int nthPrime(int nth){
        if(nth==1){
            return 2;
        }
        int[] foundPrimes = new int[nth];
        int primeCount = 1;
        foundPrimes[0] = 2;
        int prime = 3;
        while (true){
            if(isPrime(prime, foundPrimes, primeCount)){
                foundPrimes[primeCount] = prime;
                primeCount++;
            }

            if(primeCount==nth)
                break;

            prime+=2;

        }
        return foundPrimes[primeCount-1];
    }


    public static void main(String[] args) {
        Program p = new Program();
        p.nthPrime(500);
        long start = System.currentTimeMillis();
        for(int i= 0;i<100;i++)
        System.out.println("The prime number is: "+p.nthPrime(150000));
        long end = System.currentTimeMillis()-start;
        System.out.println("Time: "+end + " ms");
    }
}

Code for 2nd example:

public class TimeFormatter {
    private char[] digits = new char[7];
    private int _cursor;
    public String formattedTime(int currentPackgeTimeStamp) {
        int secondsPassed = currentPackgeTimeStamp / 1000;

        int minutesPassed = secondsPassed / 60;
        int seconds = secondsPassed % 60;
        int decimals = (currentPackgeTimeStamp % 1000) / 100;
        reset();
        push(minutesPassed / 10);
        push(minutesPassed % 10);
        pushChar(':');
        push(seconds / 10);
        push(seconds % 10);
        pushChar('.');
        push(decimals);

        return new String(digits);
    }

    private void reset() {
        _cursor = 0;
    }

    private void push(int i) {
        pushChar((char) ('0' + i));
    }

    private void pushChar(char c) {
        digits[_cursor] = c;
        _cursor++;
    }

  public static void main(String[] args) {

    long start = System.currentTimeMillis();
    int iterations = 100_000_000;
    TimeFormatter timeFormatter = new TimeFormatter();

    int sum = 0;
    for (int i = 0; i < iterations; i++) {
        String t = timeFormatter.formattedTime(125400);
        sum += t.charAt(0);
    }

    long end = System.currentTimeMillis();

    long duration = end - start;
    System.out.println("Duration: " + duration + ", sum: " + sum);
}

Sunday, May 22, 2016

Flat Collections - May update

Flat Collections was upgraded again in small ways but they are to simplify the design of code generation.

Java does not have equivalent of "struct" keyword and it is really useful sometimes to generate your own arrays of primitives which have typically bigger primitive list of items. These flat collections can be really useful in most of List<Point> kind of scenarios or List<Token> where a token could be a parse information which packs few integers (tokenKind, startPos, endPos).

Also, if there is any Java develop who can help me to review, package this library, write me a note (or contact me via email with "ciprian dot mustiata at gmail.com".

Follow the code here:
https://github.com/ciplogic/FlatCollection

Monday, May 9, 2016

Quickest Java based CSV on Earth...

If you look over the internet, CSV parsing is really solved and it is really quick. You can parse an 120 MB CSV file in around 1 second (using 1 core). Take this file from this repository: https://github.com/uniVocity/worldcities-import

They have their own bench on my machine and the output is (after JVM is warmed up):
Loop 5 - executing uniVocity CSV parser... took 1606 ms to read 3173959 rows.

But, can you beat it by the help of FlatCollections? The answer is obviously yes, and not by a small amount, but also taking into account that the coding is a bit non trivial.

How much it would take to sum the forth column times using a "miniCsv" library?

int[] sum = new int[1];

CsvScanner csvScanner = new CsvScanner();

   try {
    csvScanner.scanFile("worldcitiespop.txt", (char) 10, (state, rowBytes) -> {

      int regionInt = state.getInt(3, rowBytes);

     sum[0] += regionInt;

});

   catch (IOException e) {
    e.printStackTrace();

This code would sum the 4th column using this huge file after the JVM is warmed in...
Time: 371 ms

So, really, if you have a small CSV and you have many integers (or if you need to support other types, I will spend a little time to handle more cases) to calculate about, I will be glad to sped it up, just reference me as the "original" coder.

The file has to be UTF8 or ASCII, Latin, or similar byte encoding, but not UTF16 or UTF32.

So, if you feel that you want to take a look into a specialized CSV parser and you see any improvements, please feel free to read, contribute and do whatever you want with it!

https://github.com/ciplogic/FlatCollection/tree/master/examples/MiniCsv

Bonus: there is no commercial license (you can even sell the code, but it would be nice to be credited though).

The idea how to code it would not be possible without my previous work experience and great people doing this stuff for a living (I do it for passion) like Martin Thomson, Mike Barker or similarly open people. Also, I did not hear them without InfoQ.

Sunday, May 1, 2016

How Quick Can You Rotate a 4K (3840x2160) Image?

First of all this mini-competition started at work with this idea of a flamewar: "C++ (native) languages are quicker than Java".

I could say: "obviously it isn't", but we wanted to be tested. So we considered a great test where C++ is known to shine is pointer arithmetic and rotating pixels in an image would be a very friendly C style coding.

So, can you write your own implementation quicker than a Java implementation to rotate 4K images? But I want to say some observations I did as I tested some implementations.

A reasonably quick implementation with Java is this one, where pixels of a 4K image are stored flat inside src array, and dest is a preallocated array of the same size (3840x2160):

public void rotate90(int[] src, int[] dest, int width, int height) {
    IntStream.range(0, height).forEach(y -> {
        int posSrc = y * width;
        int destPos = height - 1 - y;

        for (int x = 0; x < width; x++) {
            int srcPixel = getPixel(src, posSrc);

            setPixel(dest, destPos, srcPixel);
            posSrc++;
            destPos += height;
        }
    });
}

This implementation would run in around 110 milliseconds. This implementation is really useful, because using a single line of code change, it will run using all cores:
IntStream.range(0, height).parallel().forEach(y -> {

This will make the code to run at 33.7-37 ms.

One colleague from work wrote this implementation (Mykolas):

public void rotate90Mykolas(int[] src, int[] dest, int width, int height) {
    for (int i = 0; i < src.length; i++) {
        dest[(i % width + 1) * height - (i / width + 1)] = src[i];
    }
}

Is it any slower or faster? Looking to instructions, it should run slower, as instead of looping, there is a plain complex math (divisions or multiplications). But in fact it run faster than the single core version: 100 ms.

At the time of writing this blog entry, this code is not written in parallel. but if I will get a new entry, the code will be updated.

Can be written quicker still?

It depends on which hardware, but in short the answer is yes:

This code is starved by memory accesses, so rotating blocks of 32 pixel squares would rotate it much quicker as the data is mostly in the CPU cache:

public static final int SIZE_CHUNK = 32;

static int calculateChunks(int size, int chunkSize) {
    return (size / chunkSize) + ((size % chunkSize == 0) ? 0 : 1);
}

private static void fillChunksSizes(int width, int chunkSize, int stepsX,

    int[] chunksPos, int[] chunksPosLength) {
    for (int it = 0; it < stepsX; it++) {
        chunksPos[it] = it * chunkSize;
        if (it != stepsX - 1) {
            chunksPosLength[it] = chunkSize;
        } else {
            int reminder = width % chunkSize;
            chunksPosLength[it] = reminder == 0 ? chunkSize : reminder;
        }
    }
}

public void rotate90Chunks(int[] src, int[] dest, int width, int height) {
    int chunkSize = SIZE_CHUNK;
    int stepsX = calculateChunks(width, chunkSize);
    int[] chunksPosX = new int[stepsX];
    int[] chunksPosXLength = new int[stepsX];
    fillChunksSizes(width, chunkSize, stepsX, chunksPosX, chunksPosXLength);

    int stepsY = calculateChunks(height, chunkSize);
    int[] chunksPosY = new int[stepsY];
    int[] chunksPosYLength = new int[stepsY];
    fillChunksSizes(height, chunkSize, stepsY, chunksPosY, chunksPosYLength);

    IntStream.range(0, chunksPosX.length).parallel().forEach(chunckXId -> {
        int startX = chunksPosX[chunckXId];
        int lengthX = chunksPosXLength[chunckXId];
    IntStream.range(0, chunksPosY.length).forEach(chunkYId -> {
        int startY = chunksPosY[chunkYId];
        int lengthY = chunksPosYLength[chunkYId];
            rotateChunkByIndex(src, dest, width, height, startX, lengthX, startY, lengthY);

        });
    });
}

This code runs on average on a Haswell CPU in 7.85 millisecond (so is around 4 times quicker than iterating over the loops "naively").

The quickest of all I could come with is by rotating blocks which are exactly the chunk size of 32 as specialized implementation. Compilers love constants and love them more if they are typically power of 2.

This sped up a little the code, but the code is basically bigger than this previous implementation and some copy/paste of it, and it runs in 7.2 ms.

So, this is it, you can rotate 9.1 images per second with a loop, using a single thread, and if you use all cores in a i7 laptop, and you take into account how compiler optimizes and CPU caching, you can achieve 138.9 images per second running Java. 4K images.

This is 4 GB/s image processing.

But there is one more thing. This coding works very nice in CPUs which hide divisions, with many SIMD supported instructions, with a high end machine, but how does it work with a low end machine (similar with a phone CPU - including iPhone)?

I've took the code and ran it with CPU Intel(R) Celeron(R) N2930@ 1.83GHz (which is an out-of-order 4 core Pentium class CPU).

Numbers totally changed:
Single threaded rotate image: 119.86 ms.
Multithreaded first test:44.44 ms.
Mykolas implementation: 265 ms.
Chunks: 38.4 ms.
Chunks64: 27.1 ms.

Some observations: moving code from an i7-4710HQ 2.5 GHz to an Baytrail CPU, the speed decreased less than 10%. Even using 4 cores Baytrail vs 4 cores+HT I7M, if your software is memory starved, your code will run roughly the same.

Mykolas implementation got 2.5 times slower, because complex math is expensive on Atom based CPUs. Try using multiplications instead of divisions using lower spec CPUs.

The chunks implementation is also very interesting: when you have a math intensive code but you fit into cache, the Atom CPU is roughly 4x slower than an I7M (and I think even more compared with newer CPUs like Skylake).

So, can you try to make a quicker 4K image rotation than 7.2 ms (in a quad core I7M CPU - so, more than 4GB/s pixel processing)? At your request I will give a full source code of the fastest implementation (which is very similar with Chunks implementation, but just longer). Can you process more than 1.1 GB/s of pixels on an Atom based quadcore?

Happy coding!

Tuesday, April 26, 2016

Fx2C Updates - handling loading Fxml 3D objects

Fxml to Java compiler speeds up for low spec machines the speed of showing controls, but one very nice contributor fixed support of adding CSS styles. I never tested it, but I noticed that some other edge cases were not supported.

The main use-case is this one: you want to use Fxml to import Java3D objects, they required the inner text xml tag to be handled separately. For example this Fxml file, is valid Fxml:

<?xml version="1.0" encoding="utf-8"?>
<?import javafx.scene.paint.Color?><?import javafx.scene.paint.PhongMaterial?><?import javafx.scene.shape.MeshView?><?import javafx.scene.shape.TriangleMesh?>
<MeshView id="Pyramid">
  <material>
    <PhongMaterial>
      <diffuseColor>
        <Color red="0.3" green="0.6" blue="0.9" opacity="1.0"/>
      </diffuseColor>
    </PhongMaterial>
  </material>
  <mesh>
    <TriangleMesh>
      <points>0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 -1.0 -1.0 1.0 0.0 0.0 -1.0 0.0</points>
      <texCoords>0.0 0.0</texCoords>
      <faces>0 0 4 0 1 0 1 0 4 0 2 0 2 0 4 0 3 0 3 0 4 0 0 0 0 0 1 0 2 0 0 0 2 0 3 0</faces>
      <faceSmoothingGroups>1 2 4 8 16 16</faceSmoothingGroups>
    </TriangleMesh>
  </mesh>
</MeshView>

This file is definetly valid Fxml, but the Fx2C compiler will not be able to handle it: nodes contain inner text.

If you want more samples and importers from multiple 3D formats (like STL or Collada) follow the next link:
http://www.interactivemesh.org/models/jfx3dbrowser.html

Now it does, so for previous Fxml file, the Fx2C compiler will export the following code which is close to the fastest way to define a MeshView:

public final class FxPyramid {
   public MeshView _view;
   public FxPyramid() {
      MeshView ctrl_1 = new MeshView();
      ctrl_1.setId("Pyramid");
      PhongMaterial ctrl_2 = new PhongMaterial();
      Color ctrl_3 = new Color(0.3, 0.6, 0.9, 1.0);
      ctrl_2.setDiffuseColor(ctrl_3);
      ctrl_1.setMaterial(ctrl_2);
      TriangleMesh ctrl_4 = new TriangleMesh();
      ctrl_4.getPoints().setAll(0.0f, 1.0f, 1.0f, 1.0f, 1.0f, 0.0f, 0.0f, 1.0f, -1.0f, -1.0f, 1.0f, 0.0f, 0.0f, -1.0f, 0.0f);
      ctrl_4.getTexCoords().setAll(0.0f, 0.0f);
      ctrl_4.getFaces().setAll(0, 0, 4, 0, 1, 0, 1, 0, 4, 0, 2, 0, 2, 0, 4, 0, 3, 0, 3, 0, 4, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 2, 0, 3, 0);
      ctrl_4.getFaceSmoothingGroups().setAll(1, 2, 4, 8, 16, 16);
      ctrl_1.setMesh(ctrl_4);
      _view = ctrl_1;;
   }
}

Tuesday, March 15, 2016

Reifying DSL for FlatCompiler

An important part of FlatCollections is the part that at least memory wise code can be rewritten only a little to get big speedup and fewer GCs. But as always there were tradeoffs. One of them is that the code itself was hardcoded to give a ListOf<T> (with a reified name like ListOfPoint3D) and a cursor out of this list.

This is all great but what if the List<T> should contain an extra method? Or what if there is a need to generate an extra method for every getter/setter? For this reason there is a simple (I hope) template generator which has reified semantics which for now works only for classes but is really important.

To define a flat type, you would write something like:

flat Point3D {
  X, Y, Z: double
}

And the new code will be aware of these fields to be filled later.

The code generator is filled using a templated form as following:

each fieldNames : fieldName, index {
    sub set@fieldName (TValue value) {
        _list.set(_offset+ index,  value);
    }

    sub get@fieldName (): TValue {
        return _list.get(_offset+index)
    }
}

Sure, the code look a bit strange, but it does the job most of the way, and there are items as TValue and so on, they are resolved semantically:

class FlatCursor<T> {
    where {
        TValue = T.valueType
        countFields=T.countFields
        fieldNames = T.fieldNames
    }

(...) //class content

But the solving appears because of a semantic magic:

specialize ListOf { Point3D }

I would love to improve it more in future, but mileage may vary. But the most important part is that soon the reification can work fairly smart and more I add logic into this mini-compiler, the more constructs may be supported and bugs found.

Read the latest code under GitHub project:
https://github.com/ciplogic/FlatCollection

Friday, March 4, 2016

Is it aquisition of Xamarin useful for typical C# developer?

TLDR In my view in short term: yes, in long term: no way!

Xamarin/Mono stack missed many features of Microsoft's .Net stack and will always suffer if it is an economical force behind it. Xamarin Studio for example is it in a painful state: bugs are slowly fixed, the recommended version it is still Visual Studio, but you can also use Xamarin Studio for various purposes. It is stuck with Gtk# 2.x, though very nice styled and with an unknown underlying framework for developers (Xwt).

Xamarin bought RoboVM, which means that if you are either a C# or Java developer and you want to target iOS, you may need in the past to rely on Xamarin (now Microsoft).

My perspective about medium plan with Mono platform: Mono will be less ambiguous target and more bugs will be addressed just having one implementation in between .Net, CoreCLR and Mono. Another good thing is that I would assume that in future there will be merged the CoreCLR on Linux with Mono, either by migrating the GC of CoreCLR (which is more advanced that whatever Mono had) or migrating the debugging infrastructure from Mono to CoreCLR. This means that if you will target Asp.Net Core 1.0+ you will definitely benefit from the platform correctness and a better experience deploying to Linux.

Another good part of the toolset it is simply that Microsoft .Net as merged platform will work directly to iOS, maybe with a lower license costs.

But this is just for me 1 to 2 years stuff, but after this I would assume that some parts will be more negative for non Microsoft platforms:
- support may be delayed and slowed down, in particular that supporting .Net will be needed to be extensive to most of Mono targets, CPU architectures and so on
- no competition even partnering competition (as is with Java/OpenJDK ecosystem) will mean that IDE options (SharpDevelop is basically discontinued, maybe Xamarin Studio will be also discontinued) will be basically from two vendors, one of them with full integration with various frameworks (Microsoft) and one very well integrated for code editing (JetBrains). Both of them may be for money, so I would assume that will not be so much startup friendly
- having close to a monopoly as the single vendor implementing your own runtime is kind of kill-switch to make your next project to target .Net excluding you are not Microsoft or you already have a big investment into .Net technology

Thursday, February 18, 2016

Question: "Does Java run faster than C and C++ today?"

As I was writing this allocation free parser, I ported the code (90%, in the sense that I did not use smart-pointers) to C++ with hoping that bounds checking or other hidden wins will show off.

The single problem is that C++ is very tricky to optimize.I tried all my best, I did not use any bounds checking (so I skipped using STL all-together), I send as much as I understood everything as const-reference when it was not an integer but a data buffer, and so on. So I did all low-level optimizations I knew and the code was having the same level of abstraction as Java. For very curious people and if requested, I will be glad to give it as a zipped file (the code leaks memory, but when the loop is executed with zero memory allocation - exactly like Java).

But the biggest bummer for C++ is that it ran slower than Java.

Most of the time Java code would achieve a bit more than 800 iterations, rarely 900, and rarely something like 770 iterations (there are fluctuations because of CPU's Turbo, which is very aggressive on a laptop, like it has a stated 2.5 GHz but it operates at 3.5 when is using 1 core). With C++ I could iterate all QuickFix's test suite in 700 to 800 range of iterations. This happened with MinGW GCC 4.9 (32 bit) with -Ofast -flto (as for now being the fastest configuration). The part where C++ wins hands down comparing with Java is memory usage, where the C++ implementation was using just a bit over 5 MB, when Java implementation was using 60 MB. So there are differences, but still, Java was running visibly faster. I tried also using GCC on Ubuntu. But Ubuntu uses GCC 4.8 (64 bit) and at least this code seems not to optimize well and I get just 440 iterations.

But you know what? The Java code was really straight forward, no configuration/ runtime optimization settings. Everything was running just faster. There is not even a debug/release configuration. Java runs as quick (like equivalent with GCC -O3) up to the point it hits a breakpoint. If you hit a breakpoint, it will go back to interpreter mode.

Even it seems kind of stupid, I think that I can see some conclusions of it, if it is kind of possible in many situations for Java to run as smooth, an office suite, like let's say LibreOffice were better off if they were gradually rewritten in Java, instead of removing it because it starts a bit slower. I could imagine a hypothetical future where JavaFX were the dialogs, later the canvas and it would work on almost all platforms where JavaFX runs, including but not limited to: iPhone (it would require RoboVM though, which today is proprietary), Android (GluOn) and would have support for common databases (because of JDBC which has a very wide support) to fill data in the "Excel" (tm) component of the suite.

At last, let's not forget the tooling and build times. Java takes really a fraction in compilation, most of the build time is copying Jars.

But as it is, if you think you have at least high volume and you require a high throughput for your program, try Java, you may really break records.

Tuesday, February 16, 2016

Scanning FIX at 3 Gbps

Have you heard about FIX protocol? It is a financial exchange protocol. It is used extensively as a de-facto format to process in many areas and the format itself it is kind of many dictionary key-value pairs.

So, can you make a quick parser to process FIX files? I did write a mini FIX parser in Java and it uses FlatCollections for tokenizing and the final numbers are really great. But let's clear the ground: most of the ideas are in fact not mine, and they are based on talks about "Mechanical Sympathy" (I recommend presentations of Martin Thomson) meaning that if you understand the hardware (or at least the compilers and the internal costs of it) you can achieve really of high numbers.

So I looked around to QuickFix library, a standard and opensource (complete) implementation of FIX protocol, but it also has some problems of how the code is running so I took all example of FIX protocol sample files. Files: around 450 files combined at 475KB of ASCII files and I setup my internal benchmark as following: considering that I will have them in memory, how quick can I parse them, give full tag to user and it is good enough info to recreate the data. As the code for one file should be really quick (if there is no allocation in file row splitting, which I already did), I made the following "benchmark": how many times in a second I can iterate these files (if they are already saved in memory), split them into rows and tokenize them. The short answer: between 700 to 895 iterations (using one core of Intel Core i7-4710HQ CPU @ 2.50GHz). The variation I think is related with CPU's Turbo. I am not aware of code having hidden allocations (so is allocation free). If there are few allocations (which were done before usage Flat Collections) you will get in 500-700 iterations range (or 2.5 Gbps processing speed)

So, if you have (on average) 800 iterations per second, you can parse around 380 MB/s FIX messages (or around 3 Gbps) using just one core of one laptop using Java (Java 8u61/Windows). If you want another statistic, most messages are few tens of bytes, so, it is safe assume that this parsing code scans 20 million messages/second.

I don't endorse switching your QuickFix to this minimal Fix implementation, but who knows, if you need a good starting point (and who knows, support ;) ) to write a very fast Quick parser, this is a good point to start.

So, if you want to look inside the implementation:
https://github.com/ciplogic/FlatCollection/tree/master/examples/MiniFix

Saturday, February 13, 2016

Java's Flat Collections - what's the deal? (Part II)

I thought about cases when people would want to use flat collections. The most obvious are for example an "point array", "Tuple array", but as thinking more I found some interesting case which is also kind of common: "rectangle", "triangle" or similar constructs.

Typically when people define a circle for instance, would build it as:
class Circle{
Point2f center = new Point2f();
float radius;
}

Without noticing maybe, if you have to store for a 32bit machine one hundred of circles, you will store in fact much more data than the: center.x, center.y, radius x 4 bytes = 12 bytes per circle, and for 100 circles is 1.2 KB (more or less), but more like:
- 100 entries for the reference table: 400 bytes
- 100 headers of Circle object: 800 bytes
- 100 references to Point: 400 bytes
- 100 headers of Circle.center (Point2F): 800 bytes
- 100 x 3 floats: 1200 bytes

So instead of your payload of 1.2 KB, you are into 3.6 KB, so there is a 3X memory usage compaction.

If you have 100 Line instances which themselves have 2 instances of Point2f, you will have instead of 1600 B: (refTable) 400 + (object headers) 2400 bytes + (references to internal points) 800 + (payload) 1600 = 5200 B which is a 3.25X memory compaction.

A simple benchmark shows that not only memory is saved, but also the performance. So, if you use Line (with internal 2 points in it) and you would populate flat collections instead of plain Java objects, you will get the following numbers:

Setup values 1:12983 ms.
Read time 1:5086 ms. sum: 2085865984

If you will use Java objects, you will have a big slowdown on both reading and writing:
Setup values 2:62346 ms.
Read time 2:18781 ms. sum: 2085865984

So, you will get more than 4x speedup on write (like populating collections) and 3x speedup on read by flattening most types.

Last improvement? Not only that reflection works, but sometimes it is ugly to create a type, reflect it and use it later for this code generator of flatter types. So right now, everything the input config is JSon based, and you can create on the fly your own "layouts" (meaning a "flat object"):

{
  "typeName": "Point3D",
  "fields": ["X", "Y", "Z"],
  "fieldType": "float"}

This code would create a flat class Point3D with 3 floats in it named X, Y, Z (meaning the cursor will use a "getX/setX" and so on).

Here is the attached formatted input of the code generator file named: flatcfg.json.

Wednesday, January 27, 2016

Java's Flat Collections - what's the deal? (Part I)

I've been moving my job to Java environment and as my previous interest in performance is still there, I thought why sometimes Java runs slow (even typically - after you wait your code to warm up) and I described some solutions that are around Java's ecosystem.

There are 3 solutions for now which are competing to give high performance Java code by allowing quick performance or OS integration:

- PackedObjects, a way to remove object headers to object collections and it works sadly for now only with IBM JVMs, It should be primarily used by JNI like code to speed it up and removing individual copying. It requires medium changes in compiler, garbage collectors but no language changes (or minimal ones)

- ObjectLayout, a way to give hints for JVM to allocate continuously arrays in a structured manner which may be implemented. It requires GC changes, very few annotations but no language changes

- Array 2.0 (or Project Panama) is the project which basically plans to bring .Net's struct type to Java, This is the most extensive of all because it has to change: bytecodes, internal changes inside compiler, inside GC

So, I'm here to present a new solution which I found it handy, but it is in very early stage and requires no language changes (still, to take advantage of this, you require yourself some few code changes), it should work with any Java at least newer than 5.0 (maybe Java 1.2, but I'm not 100%) or if it is not fully possible to work with this solution, it will be very easy to patch.

Enter FlatCollection, a code generator which flattens your collections and can make it very easy to work with high(er) performance code for many common cases.

How does it work:
- you find any types it has the same type of fields (for now I think the coding supports only primitive types, as the fully working prototype works with Point of x,y integer fields, but very likely at the time you may read this code, it will work as a generator for any field type)

- you add all types with full namespace inside: input.flat file
- you run the project to create two flat classes out of it: an ArrayListOfYourType, and a CursorOfYourType
- copy all these files inside package you will add in your project: "flatcollections"

Look inside a "realistic" benchmark code to see the same code using an array of Point and this mapped ArrayList inside RunBench.java .

In this kind of real life, the memory consumption for this collection is in range of a half of a full array of points, and the performance of both populating it and reading it is at least 2x-4x in performance.

How does it work: it merges all fields in a continuous array of "primitive" types and removes basically one indirection and many allocation costs.

I will extend in future the samples to show parsing of CSV files and operations like it. If you reuse the collection using .clear() call, no reallocation is needed, excluding the new "per-row" code allocates more memory than previous implementations.

Why is important to flatten the data layout? Basically, you can reduce the GC hits or you can map naturally code that was ugly otherwise: let's say to have a Tuple class. Also, the full GC cost (which involves visiting all small objects in your heap) is very low on these collections. So I would assume at least for batch processing or maybe games written in Java it could be a really friendly tool of trade.

What should be done:
- it should be tested to work with collections of any type and to support specializations
- it should work with POJO which are not exposed as fields including Tuple classes
- not mandatory but it should support iterators or any other friendly structures

Code Refractor - Virtual Machines/Compiler performance musings