Sunday, September 20, 2015

Optimizations on bigger data sets

I started around one year ago jHeroes2, wanting to be a port of FHeroes2 from C++ to Java. As I had the code around, I also tried to do part of code in C#. The ugly part was that using a standard toolkit (I'm saying about JavaFX or WPF in C#) the drawing is fairly slow. Is faster than let's say Windows Forms on displaying multiple bitmaps, but it is still fairly slow (or I don't know myself how to optimize the code for either toolkits).

But one interesting part of the bunch is that I learned some tricks that were done originally by the team of the Heroes 2 when they did their "big archive with all content" file. They did a full index, and later the file content is made as a simple algorithm of "painting". The algorithm of painting is really nebulous and I'm impressed by the idea that the FHeroes2 guys succeeded to uncompress all pictures.

So based on this and also of my working place experience, I thought that it would be handy to take all pictures of Heroes2 main Agg file, decode all images and (re)compress them as full bitmap data. As data is compressed, as a result I did make a full zip with all pictures are inside Heroes 2 that could be decoded and I repacked them. I did not use indexed colors (even they would reduce the bitmap size) and I did not save them as native bitmaps because I wanted to check some things I will elaborate in this entry.

So what the compressor does:
- iterate over all pictures in heroes2.agg file (a 43 MB file)
- extracts them in memory as bitmaps
- take every bitmap and saves it either in a text format, where every pixel is a hex value, or a binary integer array
- compress every array of text/bytes to a zip entry

What the benchmark test does:
- take the zip format entries one by one
- decode width/height first then creates an integer array and reads/decode the hex strings/byte arrays
- converts this int array into an Image format.
- ouputs the time.

First of all the zip file for binary data looks like following:


This in short presents 15.000 files (pictures) that if are extracted would be like 250 MB of data that if decoded to disk would look like this:




Given this much data I'm sure that I would be interested to see how quick it would work to decode all photos. 

So first of all, having these two zips, I would want to have a baseline. So for this I started with .Net code to extract all zip content micro files in byte arrays into memory. The timing was the following:
Time: 2604 ms
Time text: 3276 ms
This means in short that if you would want to uncompress with the latest .Net on a 64 bit machine on a fairly quick laptop it would take to you to around 2.6 seconds for a binary compacted data and around 3.3 seconds just to uncompress the data.

I was running the same code with Java for extracting, and it was running in around half of time. So using hex data, the decompression time will be closer to 1.5 seconds, but the times are like following. 
Time (extract text): 1678
Time with string split: 7012
Time no alloc: 4474

Time (binary data): 1685
Time (extract data): 943

A graph (with shorter bars, are better):

So, what I eventually found was that you can write quicker conversion from binary data to image in Java meaning: extracting 15.000 files in memory, make them int arrays then  convert them to pictures, in less than 1.7 seconds on my machine, than .Net time to extract the pictures.

This is great, but also, I've wanted to see another more real-life use-cases: if the users for example compress hex text files, the code to extract it, even is split in a fairly GC friendly by splitting text into lines, then splitting text into tokens, and then using Java's string to hex formatting it would run very slow, in around 7 seconds. Another interesting code, was that instead of splitting strings per row, it can be written most of the code, even with plain text with zero allocations on pixel drawing (or close to zero, there are allocations for image itself, or the integer array, and so on, but not on the processing of small pixels) and with this you can get into 4.5 seconds range.

At last, you see, .Net was really very slow, really? Yes and no, in this code Java was faster for many reasons, the simplest of them being that Java optimizer is more advanced that the .Net optimizer, so on highly tight loops when extracting code, Java was shining. Also, the code with zero memory allocation for example, or the one with binary processing, I was using lambdas knowing that Java 8 took into account the idea to optimize this code.

Could it be reduced the time to be less than 1.7 seconds? Yes, but not by much, excluding, and here is the main part: that Heroes 2 has a 256 color palette. Reducing the full bitmap data into a palette code, would reduce the 250 MB of data to around 85 MB of data, this meaning that extracting would require around 1/3 of time, and similarly, the uncompressing of data, would be also very friendly to memory allocator. I would expect that extracting of 85 MB of data compressed (which would be very likely under 10 MB mark) would take maybe a bit less than one second.

So what I've learn myself and also people curious how to improve their application performance:
- if they have bigger projects/documents that their application should load: compact the data into binary. If you need to save plain text data, save plain text data, and make a binary copy of it in a temporary folder and a hashed key file to make sure the original file is not modified. If is not modified, load the cached binary file.
- use Java over .Net if you want to have very big batch processing
- reduce memory allocation/reallocation. This can reduce even for text based format to just 2.5 times slower than the full binary format. 

Thursday, September 10, 2015

Apple's September Keynote CPU claims review: Why You Should Not Buy iPad Pro

First of all, you may buy your Apple products for other reasons than this post, but the single reason why I will make here this post is simply: Apple definetly lies customers over the keynote and if you care about honesty, I would not buy at least the iPad Pro.

So, first of all watch the keynote if you didn't watch it already! and after this we will have to define our terms. This is second time (after launching of iPhone 5S) when they state that the CPU inside the Apple product is "desktop class CPU", "Console level graphics" and things like it. If you want to be "selled" please take their message in writing.

So why Apple is dishonest, let's take claim by claim:

- the CPU is 1.8x times faster than their last tablet (iPad Air 2) and it will be faster than 80% of CPUs found today in mobile devices (like laptops). This is hugely dishonest at best. Their lowest cost model is at 800 USD, and at this price it is definitely on the lowest performing devices and lower spec. For 800 USD (or in EU will be like 800+ EUR) you will be able to buy more than 32 GB of storage SSD and a fairly beefy laptop. I bought Lenovo Y50 but previous year model with quad-core CPU (compared with very likely just 3 core iPad), 8 GB RAM, 1 TB storage (SSDH, but I would opt in for a 250 GB SSD), UHD ("4K") for 950 USD.

So let's benchmark it in a one core CPU, and let's use a benchmark that is not optimized for Apple (neither for Intel): Kraken benchmark. Apple iPad Air 2 (their fastest iPad) would give like 4000 ms. Let's say that Apple would achieve 1.8x (not "up-to" but true 1.8x) speedup. This would mean that this newest tablet in one thread performance would have like: 2200 ms to finish the JS Kraken benchmark.

Running today on this Lenovo laptop?  1134.2ms in Google Chrome and 1118.3ms in Firefox. In my book it means that a typical powerful laptop in the same class with the "iPad Pro" price wise should be at least 2x faster in single core performance and in a multicore scenario would be like: 2x * 4/3 (4 cores in laptop vs 3 core in iPad) = which would be more than 2.5x times slower.

2.5 times CPU maybe is a small inconvenience, but what about the rest? 

GPU benchmarks?
3DMark IceStorm has around 210K points on the Lenovo Y50 laptop because of powerful NVidia 860M. IPad Air 2 is around 22K. If iPadPro has 2 times the performance, will be still around 5 times slower than a mobile GPU.

Memory?
2GB LPDDR3 vs 8 GB (low power LDDR3) memory. Not sure about you, but having less memory for a productivity device is for me a big differentiation factor.

Screen? 

4K (3840x2160) is bigger than iPad Pro resolution (2,732 x 2,048). And the DPI should be comparable (the Lenovo has 15 inch vs close to 13 on iPad Pro).

Other factors?
Maybe there is another factor that you would want a laptop. Maybe later expand your storage space or memory. I did this in fact, I switched from my HDD to SSD and from 8 GB to 16 GB. I could do it by myself in fact, it was really easy enough.

I am sure that if you care about your money and you are not wanting to show your Apple elitism, I'm sure that with a quad core laptop you can get better on the high end.

But who knows, maybe it is desktop class CPU, if you take the crappiest CPUs from the market. For example, again, if Apple is given 1.8x speedup, would match most of integrated AMD APUs, but this is again dishonest, because their (integrated) GPUs should be at least similar with the class of the 860M graphics, maybe a half. And the entire AMD systems cost in around half of price, or Intel lower cost laptops.

In fact a system which is comparable with iPad Pro should be one like this one. Of course, not on the screen size, but on the "Pro" computing specifications.

Conclusions
Of course probably you decided before reading the article and a rant kind of a blog entry should not make you change your view. Still, if you care of a company being honest, and you don't mind to get in fact a kinda Atom Quad-Core performance kinda laptop but in a tablet form factor sold on the price of entry level 
i7 Quad-core laptops, then go ahead...