Saturday, March 8, 2014

The Relevance of Assembly Language

Do you remember back in ye olde DOS days when performance critical code was all done in assembly? Back then optimizing compilers had the burden of managing segments, fighting instruction speed vs. instruction fetch time, and all sorts of unpleasant compilation challenges, not the least of which were the horrendously underpowered processors of the time. On the original PC, you couldn't even write a C function to clear the console without noticeable delay because compilers simply couldn't do any better.

Of course, times changed. With 64 bit processors and enormous processor caches, old issues like segments and instruction fetch are practically moot. Combine that with 30 some years of faster processors and improved compiler design, and you start to wonder who on earth needs assembly any more?

Oddly enough, I ran into an issue involving an assembly level solution just a few days ago. I was working on a piece of raytracing code, and things seemed to be working fine. It was generating a basic scene with reflections, lighting, and shadows at roughly 120ms per frame: Not exactly real time, but hardly something to scoff at either.
The whole raytracing code was designed with modularity in mind, and part of that involved using templated vector classes. It seemed great all around; you could define a vector of any type and any size without needing to write new code, and the optimizing compiler made the vectors perform as if you'd hand optimized each one.

At least that's what I thought. 

On a whim, I decided to try making a hard coded version of the Vector3f; by far the most popular vector configuration. I didn't do anything special; I just copied the vector template and filled in the generic template types and sizes with the specifics for the Vector3f. Much to my surprise, the scene was suddenly generated at roughly 100ms a frame instead. Imagine that! I improved the performance of the entire program by 16% just by copying code and not even doing anything clever about it.

That's when I got suspicious. I went to the ray/triangle intersection function, both a performance critical and vector heavy part of the program, and used the _asm { int 3 } trick to generate a breakpoint in the optimized release build. I think I literally facepalmed once I saw the code the compiler was generating.

On the plus side, it was doing a lot of things that you'd expect in a piece of well optimized assembly code:

  • The code was laid out in small groups of independent instructions to maximize usage of super scalar capabilities in the processor. 
  • Branching instructions were reorganized to be used as sparingly as possible to prevent potentially flushing the pipeline.
  • It used every register it feasibly could to avoid accessing memory and the potentially costly trip to DRAM.
  • It was even smart enough to use SSE vector registers for the vector code.

Unfortunately, there were a lot of missed opportunities:

  • For some reason, it decided that EVERYTHING should be stored in the vector registers, regardless of whether or not it had anything to do with vectors. Why not use the integer registers? Last time I checked, the ALU pipeline is separate from the SIMD pipeline, meaning the CPU could execute vector and integer code all entirely in parallel.
  • The compiler seemed to do some strange hack with the instruction pointer in order to get valid addresses for branching. This might actually be a "standard" way of getting a branch address in the vector registers, but I wouldn't know; I usually store my branch addresses in the integer registers where, ya'know, it makes a little bit more sense.
  • Although the compiler was using vector registers, it took absolutely no advantage of vector instructions. It was storing each 32 bit component of the vector in a separate 128 bit vector register and adding them independently. Apparently in it's excitement to take advantage of super scalar power, it completely ignored the fact that it could store an entire vector in a single register. Doing so would both save precious register space and allow the CPU to perform entire independent vector operations at once, taking much better advantage of the super scalar power the compiler was so happy to be using.

Optimizing compilers might be better than ever, but it's kind of telling when they can't figure out that a vector is a good candidate for vector instructions.

I haven't decided yet if I'm going to take the code down to assembly and try optimizing it by hand. Regardless, this was an important lesson for me about the role of assembly in modern software: Just because you're using the latest and greatest compiler doesn't mean it's generating the best code for your program. If you have a piece of performance critical code, then sure, follow the usual conventions of profiling and improving algorithmic design, but above all make sure you know what the compiler is doing with your code!

4 comments:

  1. I love these! Almost unbelievable you are self-taught! Well done!

    ReplyDelete
  2. That was very interesting. I don't know very much about all the fancy vector features of x86, since the only work I've ever done in assembly was to do with operating system development. Keep up the good work!

    ReplyDelete
  3. That is interesting. I am messing around with writing a compiler for fun, but sadly there are not very many good or recent resources. I would like to know for sure what the compiler does with my code, especially with syntax trees. Very interesting entry.

    ReplyDelete
  4. Yo Benny, Why have you stopped producing awesome stuff? Also you seem to be a complete ghost

    ReplyDelete