thebennybox: More Assembly Stuff

Assembly can be a really quirky language.

Recently, I've been playing with the vector functions in x86 assembly just to learn them a bit better and in particular learn their performance characteristics. A big motivation for this is my recent discovery that compilers don't like using vector functions in compiled code. So naturally, I want to get in and see if there's a missed opportunity here: my raytracing code could certainly benefit from some vector optimization.

Instead of jumping straight in and writing vector and raytracing functions in assembly however, I wanted to start with something simpler to make sure assembly is worth my time and brain power.

So I came up with: a basic image processing program.

Yes, I know processors are so fast that basic image processing isn't even remotely a performance challenge.
Yes, I know GPUs are far superior at this sort of task.

However, basic image processing is ideal for my purposes: The algorithms are both easily implemented in assembly and they lend themselves extremely well to vectorization. It's the perfect place to test out x86's vector features and find out what the compiler is missing out on.

So how does it perform? Quite strangely.

My first test algorithm was an image invert, which involves is taking (1 - color) for every color component of every pixel. Below is my first implementation of the algorithm in assembly. Not much optimization here, most of it is just getting things working. It operates on 128bpp images with 32 bit floating point components between 0.0 and 1.0.

;RDI - image pointer
;RSI - width
;RDX - height
ASMInvert:
push rbp
mov rbp, rsp ; Preserve stack frame

mov rcx, rsi
imul rcx, rdx ; Set RCX to width * height
shl rcx, 1 ; Double RCX for address calculation

movss xmm0, [floatOne]
shufps xmm0, xmm0, 0x00 ; Fill xmm0 with all 1.0's

align 16 ; Align the label for jumping purposes
.MainLoop: ; Image processing loop!
movaps xmm2, xmm0 ; Move the group of 1.0's to a new register
movaps xmm1, [rdi + rcx * 8] ; Load the 4 rgb components
subps xmm2, xmm1 ; Subtract the 4 rgb components from 1.0
movaps [rdi + rcx * 8], xmm2 ; Store the result back in memory
sub rcx, 2 ; Decrement by 2, because it's doubled for addressing
jg .MainLoop

mov rax, 0 ; Return 0

mov rsp, rbp ; Restore stack frame
pop rbp
ret

On my machine, this implementation can invert 99 images, each at 1000x665 resolution, in about 230 milliseconds. Not bad, but surely it can be better. After all, the only optimization this is really taking advantage of is inverting all 4 color channels at once by using a packed vector. What about super scalar functionality and inverting multiple pixels at once?

My first instinct to take advantage of this was to interleave another invert routine into the main loop so that the processor could in theory invert two pixels simultaneously. After that, the main loop looked something like this:

mov rdx, rcx
sub rdx, 2                               ; Store image size, offset by 2. This will keep a
                                                  ; second offset for reading pixels

align 16                               ; Align the label for jumping purposes
.MainLoop:                               ; Image processing loop!
movaps xmm4, xmm0             ; Move the group of 1.0's to a new register
movaps xmm2, xmm0             ; Move the group of 1.0's to a new register
movaps xmm3, [rdi + rdx * 8] ; Load the 4 rgb components
movaps xmm1, [rdi + rcx * 8] ; Load the 4 rgb components
subps xmm4, xmm3                ; Subtract the 4 rgb components from 1.0
subps xmm2, xmm1                ; Subtract the 4 rgb components from 1.0
movaps [rdi + rdx * 8], xmm4 ; Store the result back in memory
movaps [rdi + rcx * 8], xmm2 ; Store the result back in memory
sub rdx, 4                               ; Decrement by 4, because it's doubled for addressing
sub rcx, 4                                  ; Decrement by 4, because it's doubled for addressing
jg .MainLoop

Great! How does it perform?
On my machine, this can invert those same 99 images, each at 1000x665 resolution, in about 230 milliseconds.

Wait what?

Somehow, I was supposedly inverting 2 pixels at once and it didn't make the faintest iota of difference! My first suspicion was that the processor was somehow limited and unable to execute two vector instructions in parallel. After a few inconclusive experiments, I finally stumbled upon the real culprit.

I ended up going back to the original routine and making one simple change: I replaced sub rcx, 2 with sub rcx, 4. That way I'd only be inverting every other pixel in the image. If I only invert half the pixels in the image, it should run twice as fast, right?

Nope. The routine that inverted half the pixels ran at the same 230 milliseconds. What about inverting every 4th pixel in the image? That's 1/4th the number of inversions, surely it must do something.

Not in the slightest: the routine that inverted every 4th pixel ran at the same 230 milliseconds, despite performing only 1/4th the workload of the original routine!

So what about every 8th pixel? Since pixel load is seemingly irrelevant to performance, then this should run in 230 milliseconds as well. That'd be a good sign that OS interrupts or something are mucking up the measurements.

Oddly enough, this time it actually has an impact! Inverting every 8th pixel runs at about 130 milliseconds, finally getting that two-fold improvement that you'd expect when you cut the workload in half. So what's going on?

As usual, it was my fault. In my haste to explore the performance of vector functions, I completely ignored another and much more significant performance factor: Memory.

Relative to modern processors, RAM is extraordinarily slow, taking about 300 processor cycles to supply data in the event of a cache miss (That number can vary depending on a number of hardware factors, but 300 is generally a reasonable estimate). If you're writing a performance critical routine involving memory, your data better be in the cache or you're going to have a much bigger performance issue to deal with.

Unsurprisingly, a 128bpp 1000x665 image doesn't fit in the processor cache. Perhaps one of those newer processors with the gigantic multi-megabyte L3 caches can handle it, but not mine.

That also lines up perfectly with the strange performance results when skipping pixels: A cache line is 64 bytes, which stores 4 pixels in 128bpp images. No wonder it performed the same when processing every 2nd or 4th pixel: It needed to load the exact same number of cache lines to get all the pixels required. Only when it handled every 8th pixel did it need half the cache lines, thus doubling performance.

As a result of this poor cache usage, just about every memory access is going out to DRAM and taking a sizable 300 cycle penalty. I haven't looked at the cycle timings for x86's vector instructions yet, but I'm sure it's nowhere near 300 cycles! Clearly, memory is the true performance bottleneck in this routine.

Which puts all this in a bit of an odd situation: Despite it's apparent simplicity and naivete, my original implementation is pretty near optimal. Although the processor can continue executing instructions while waiting for a cache line to be loaded, it's gonna be really tough to pad out 300 cycles worth of useful processing without using the actual pixel from memory. You can try reducing the number of cache misses by moving to a lower bpp, such as the standard 32bpp, but that makes it harder to take advantage of vector instructions, partially defeating the purpose of the exercise.

Interestingly, this also means my basic implementation is faster than anything I could coax my compiler into generating. A C++ implementation compiled with g++ without optimization was 4 times slower than my assembly implementation. Adding the "-O3" flag reduced that to about 30% to 50% slower, depending on how OS interrupts felt. After toying around with more flags, including the ominous sounding "-ffunsafe-math-optimization" flag, I managed to get the compiled version down to about 15% slower than the assembly version. And in all those optimizations, the compiler STILL refused to use any vector instructions.

At the end of the day, I'm still not entirely sure if vector instructions are worth the hassle. Bearing in mind that this is a memory bound application, having a 30% to 50% gain over typical compiled code from a non-memory related optimization certainly shows great promise. However, if the processor can't handle multiple vector instructions in parallel like I suspect might be the case, that could put a serious dent in optimization opportunities.

Either way, this needs more work to be conclusive. I'm going to look into other, less memory heavy image processing algorithms and see what I find.

thebennybox

Saturday, March 22, 2014

More Assembly Stuff

No comments:

Post a Comment