thebennybox

Saturday, March 22, 2014

More Assembly Stuff

Assembly can be a really quirky language.

Recently, I've been playing with the vector functions in x86 assembly just to learn them a bit better and in particular learn their performance characteristics. A big motivation for this is my recent discovery that compilers don't like using vector functions in compiled code. So naturally, I want to get in and see if there's a missed opportunity here: my raytracing code could certainly benefit from some vector optimization.

Instead of jumping straight in and writing vector and raytracing functions in assembly however, I wanted to start with something simpler to make sure assembly is worth my time and brain power.

So I came up with: a basic image processing program.

Yes, I know processors are so fast that basic image processing isn't even remotely a performance challenge.
Yes, I know GPUs are far superior at this sort of task.

However, basic image processing is ideal for my purposes: The algorithms are both easily implemented in assembly and they lend themselves extremely well to vectorization. It's the perfect place to test out x86's vector features and find out what the compiler is missing out on.

So how does it perform? Quite strangely.

My first test algorithm was an image invert, which involves is taking (1 - color) for every color component of every pixel. Below is my first implementation of the algorithm in assembly. Not much optimization here, most of it is just getting things working. It operates on 128bpp images with 32 bit floating point components between 0.0 and 1.0.

;RDI - image pointer
;RSI - width
;RDX - height
ASMInvert:
push rbp
mov rbp, rsp ; Preserve stack frame

mov rcx, rsi
imul rcx, rdx ; Set RCX to width * height
shl rcx, 1 ; Double RCX for address calculation

movss xmm0, [floatOne]
shufps xmm0, xmm0, 0x00 ; Fill xmm0 with all 1.0's

align 16 ; Align the label for jumping purposes
.MainLoop: ; Image processing loop!
movaps xmm2, xmm0 ; Move the group of 1.0's to a new register
movaps xmm1, [rdi + rcx * 8] ; Load the 4 rgb components
subps xmm2, xmm1 ; Subtract the 4 rgb components from 1.0
movaps [rdi + rcx * 8], xmm2 ; Store the result back in memory
sub rcx, 2 ; Decrement by 2, because it's doubled for addressing
jg .MainLoop

mov rax, 0 ; Return 0

mov rsp, rbp ; Restore stack frame
pop rbp
ret

On my machine, this implementation can invert 99 images, each at 1000x665 resolution, in about 230 milliseconds. Not bad, but surely it can be better. After all, the only optimization this is really taking advantage of is inverting all 4 color channels at once by using a packed vector. What about super scalar functionality and inverting multiple pixels at once?

My first instinct to take advantage of this was to interleave another invert routine into the main loop so that the processor could in theory invert two pixels simultaneously. After that, the main loop looked something like this:

mov rdx, rcx
sub rdx, 2                               ; Store image size, offset by 2. This will keep a
                                                  ; second offset for reading pixels

align 16                               ; Align the label for jumping purposes
.MainLoop:                               ; Image processing loop!
movaps xmm4, xmm0             ; Move the group of 1.0's to a new register
movaps xmm2, xmm0             ; Move the group of 1.0's to a new register
movaps xmm3, [rdi + rdx * 8] ; Load the 4 rgb components
movaps xmm1, [rdi + rcx * 8] ; Load the 4 rgb components
subps xmm4, xmm3                ; Subtract the 4 rgb components from 1.0
subps xmm2, xmm1                ; Subtract the 4 rgb components from 1.0
movaps [rdi + rdx * 8], xmm4 ; Store the result back in memory
movaps [rdi + rcx * 8], xmm2 ; Store the result back in memory
sub rdx, 4                               ; Decrement by 4, because it's doubled for addressing
sub rcx, 4                                  ; Decrement by 4, because it's doubled for addressing
jg .MainLoop

Great! How does it perform?
On my machine, this can invert those same 99 images, each at 1000x665 resolution, in about 230 milliseconds.

Wait what?

Somehow, I was supposedly inverting 2 pixels at once and it didn't make the faintest iota of difference! My first suspicion was that the processor was somehow limited and unable to execute two vector instructions in parallel. After a few inconclusive experiments, I finally stumbled upon the real culprit.

I ended up going back to the original routine and making one simple change: I replaced sub rcx, 2 with sub rcx, 4. That way I'd only be inverting every other pixel in the image. If I only invert half the pixels in the image, it should run twice as fast, right?

Nope. The routine that inverted half the pixels ran at the same 230 milliseconds. What about inverting every 4th pixel in the image? That's 1/4th the number of inversions, surely it must do something.

Not in the slightest: the routine that inverted every 4th pixel ran at the same 230 milliseconds, despite performing only 1/4th the workload of the original routine!

So what about every 8th pixel? Since pixel load is seemingly irrelevant to performance, then this should run in 230 milliseconds as well. That'd be a good sign that OS interrupts or something are mucking up the measurements.

Oddly enough, this time it actually has an impact! Inverting every 8th pixel runs at about 130 milliseconds, finally getting that two-fold improvement that you'd expect when you cut the workload in half. So what's going on?

As usual, it was my fault. In my haste to explore the performance of vector functions, I completely ignored another and much more significant performance factor: Memory.

Relative to modern processors, RAM is extraordinarily slow, taking about 300 processor cycles to supply data in the event of a cache miss (That number can vary depending on a number of hardware factors, but 300 is generally a reasonable estimate). If you're writing a performance critical routine involving memory, your data better be in the cache or you're going to have a much bigger performance issue to deal with.

Unsurprisingly, a 128bpp 1000x665 image doesn't fit in the processor cache. Perhaps one of those newer processors with the gigantic multi-megabyte L3 caches can handle it, but not mine.

That also lines up perfectly with the strange performance results when skipping pixels: A cache line is 64 bytes, which stores 4 pixels in 128bpp images. No wonder it performed the same when processing every 2nd or 4th pixel: It needed to load the exact same number of cache lines to get all the pixels required. Only when it handled every 8th pixel did it need half the cache lines, thus doubling performance.

As a result of this poor cache usage, just about every memory access is going out to DRAM and taking a sizable 300 cycle penalty. I haven't looked at the cycle timings for x86's vector instructions yet, but I'm sure it's nowhere near 300 cycles! Clearly, memory is the true performance bottleneck in this routine.

Which puts all this in a bit of an odd situation: Despite it's apparent simplicity and naivete, my original implementation is pretty near optimal. Although the processor can continue executing instructions while waiting for a cache line to be loaded, it's gonna be really tough to pad out 300 cycles worth of useful processing without using the actual pixel from memory. You can try reducing the number of cache misses by moving to a lower bpp, such as the standard 32bpp, but that makes it harder to take advantage of vector instructions, partially defeating the purpose of the exercise.

Interestingly, this also means my basic implementation is faster than anything I could coax my compiler into generating. A C++ implementation compiled with g++ without optimization was 4 times slower than my assembly implementation. Adding the "-O3" flag reduced that to about 30% to 50% slower, depending on how OS interrupts felt. After toying around with more flags, including the ominous sounding "-ffunsafe-math-optimization" flag, I managed to get the compiled version down to about 15% slower than the assembly version. And in all those optimizations, the compiler STILL refused to use any vector instructions.

At the end of the day, I'm still not entirely sure if vector instructions are worth the hassle. Bearing in mind that this is a memory bound application, having a 30% to 50% gain over typical compiled code from a non-memory related optimization certainly shows great promise. However, if the processor can't handle multiple vector instructions in parallel like I suspect might be the case, that could put a serious dent in optimization opportunities.

Either way, this needs more work to be conclusive. I'm going to look into other, less memory heavy image processing algorithms and see what I find.

Thursday, March 13, 2014

The Perils of Euler Angles

If you've ever used a 3D program before, you've probably encountered Euler angles. You know, they're that thing where you have a rotation for the X, Y, and Z axis. Blender illustrates this nicely with it's rotate tool.

Euler angles make it nice and easy to orient objects in 3D space. If you've done any 3D modeling before, it's hard to imagine orienting things without them.

I'm okay with that. That's not the part that scares me.

What scares me is when I see programmers using Euler angles. Why? If you've written any serious rotation code, you'll know that Euler angles are an absolute programming nightmare.

Allow me to demonstrate. Let's say you want to interpolate between these two rotations.

That's easy! Just interpolate the Z angle and you've got a nice, smooth rotation!

Not so fast! What if you aren't rotating along a global axis?

As soon as you stop playing nice and rotate around a non-global axis, proper interpolation becomes nearly impossible. The fundamental problem here is that Euler angles have no notion of an axis of rotation. They store an orientation and that's it.

But wait, it gets worse. Let's say you do some heroic programming and derive a smooth rotation scheme with Euler angles. Well done! Unfortunately, there's a darker force lurking in the shadows.

You may have noticed in the images that the rotation tool got lopsided with certain rotations. That's because once an object rotates around one axis, the other axes rotate with it. That's all fine except for one little thing: What happens when a rotation leaves two of the axes aligned?

Well...

At this point, rotations around the X and Z axis are equivalent. You want to rotate around that missing axis? Too bad! You've encountered the lovely phenomenon of gimbal lock.

At best, this can cause your object to suddenly flip around 180 degrees. At worst, it can cause your rotation code to fail entirely. There's some heroics you can do to get around this, such as keeping a hidden fourth rotation axis, but at this point the trend should be clear. Euler angles are fundamentally the wrong approach to storing rotations.

So how should you store rotations then? That ultimately depends on what you're trying to do, but as a general replacement might I suggest (brace yourself) quaternions? Don't listen to the 4D hypersphere naysayers, quaternions are easy!

By their very nature, quaternions solve a lot of the issues with Euler angles. Need to know the axis of rotation? Well guess what? A quaternion's X/Y/Z is the axis of rotation. No need for crazy interpolation schemes either because your basic linear interpolation (plus normalization for good measure!) performs a nice interpolation by itself.

Perhaps best of all, it's ridiculously simple to create a quaternion from an axis and angle, making the entire non-global rotation axis dilemma moot.

Of course, quaternions might not be the right solution depending on your needs, but almost anything is better Euler angles. Let your users orient objects with Euler angles all they want; after all it's a simple and intuitive way to select an orientation. Internally, however, make sure your code is using a different rotation structure unless you want to suffer the perils of Euler angles.

Saturday, March 8, 2014

The Relevance of Assembly Language

Do you remember back in ye olde DOS days when performance critical code was all done in assembly? Back then optimizing compilers had the burden of managing segments, fighting instruction speed vs. instruction fetch time, and all sorts of unpleasant compilation challenges, not the least of which were the horrendously underpowered processors of the time. On the original PC, you couldn't even write a C function to clear the console without noticeable delay because compilers simply couldn't do any better.

Of course, times changed. With 64 bit processors and enormous processor caches, old issues like segments and instruction fetch are practically moot. Combine that with 30 some years of faster processors and improved compiler design, and you start to wonder who on earth needs assembly any more?

Oddly enough, I ran into an issue involving an assembly level solution just a few days ago. I was working on a piece of raytracing code, and things seemed to be working fine. It was generating a basic scene with reflections, lighting, and shadows at roughly 120ms per frame: Not exactly real time, but hardly something to scoff at either.

The whole raytracing code was designed with modularity in mind, and part of that involved using templated vector classes. It seemed great all around; you could define a vector of any type and any size without needing to write new code, and the optimizing compiler made the vectors perform as if you'd hand optimized each one.

At least that's what I thought.

On a whim, I decided to try making a hard coded version of the Vector3f; by far the most popular vector configuration. I didn't do anything special; I just copied the vector template and filled in the generic template types and sizes with the specifics for the Vector3f. Much to my surprise, the scene was suddenly generated at roughly 100ms a frame instead. Imagine that! I improved the performance of the entire program by 16% just by copying code and not even doing anything clever about it.

That's when I got suspicious. I went to the ray/triangle intersection function, both a performance critical and vector heavy part of the program, and used the _asm { int 3 } trick to generate a breakpoint in the optimized release build. I think I literally facepalmed once I saw the code the compiler was generating.

On the plus side, it was doing a lot of things that you'd expect in a piece of well optimized assembly code:

The code was laid out in small groups of independent instructions to maximize usage of super scalar capabilities in the processor.
Branching instructions were reorganized to be used as sparingly as possible to prevent potentially flushing the pipeline.
It used every register it feasibly could to avoid accessing memory and the potentially costly trip to DRAM.
It was even smart enough to use SSE vector registers for the vector code.

Unfortunately, there were a lot of missed opportunities:

For some reason, it decided that EVERYTHING should be stored in the vector registers, regardless of whether or not it had anything to do with vectors. Why not use the integer registers? Last time I checked, the ALU pipeline is separate from the SIMD pipeline, meaning the CPU could execute vector and integer code all entirely in parallel.
The compiler seemed to do some strange hack with the instruction pointer in order to get valid addresses for branching. This might actually be a "standard" way of getting a branch address in the vector registers, but I wouldn't know; I usually store my branch addresses in the integer registers where, ya'know, it makes a little bit more sense.
Although the compiler was using vector registers, it took absolutely no advantage of vector instructions. It was storing each 32 bit component of the vector in a separate 128 bit vector register and adding them independently. Apparently in it's excitement to take advantage of super scalar power, it completely ignored the fact that it could store an entire vector in a single register. Doing so would both save precious register space and allow the CPU to perform entire independent vector operations at once, taking much better advantage of the super scalar power the compiler was so happy to be using.

Optimizing compilers might be better than ever, but it's kind of telling when they can't figure out that a vector is a good candidate for vector instructions.

I haven't decided yet if I'm going to take the code down to assembly and try optimizing it by hand. Regardless, this was an important lesson for me about the role of assembly in modern software: Just because you're using the latest and greatest compiler doesn't mean it's generating the best code for your program. If you have a piece of performance critical code, then sure, follow the usual conventions of profiling and improving algorithmic design, but above all make sure you know what the compiler is doing with your code!