Revisiting the Business Card raytracer

May 2, 2020

What is the most beautiful piece of code you have ever seen?

Not only it is subjective, "beauty" is also a word that can mean a lot of things. If we were to be discussing of raw speed, I would bring up the Fast Inverse Square Root^[1]. If we were to judge by aesthetic, my answer would be Andrew Kensler's, 1337 bytes, Business Card Raytracer.

Back in 2013 I spent a long time starring at Andrew's code in order to make sense of it^[2]. I discussed at length how he managed to do so much with so little. Because of its real-estate constraint his program took a while to run and I remember wondering how fast it could have performed with some optimizations.

Since I have been staying at home a lot lately, I decided to spent some of my spare time on it. I started with SIMD, followed with compiler flags, then multi-core, and ended up involving the GPU with CUDA. It was a prolific week-end which saw the runtime drop from 101,000ms to 150ms!

Establishing a baseline

#include <stdlib.h>   // card > aek.ppm
#include <stdio.h>
#include <math.h>
typedef int i;typedef float f;struct v{
f x,y,z;v operator+(v r){return v(x+r.x
,y+r.y,z+r.z);}v operator*(f r){return
v(x*r,y*r,z*r);}f operator%(v r){return
x*r.x+y*r.y+z*r.z;}v(){}v operator^(v r
){return v(y*r.z-z*r.y,z*r.x-x*r.z,x*r.
y-y*r.x);}v(f a,f b,f c){x=a;y=b;z=c;}v
operator!(){return*this*(1/sqrt(*this%*
this));}};i G[]={247570,280596,280600,
249748,18578,18577,231184,16,16};f R(){
return(f)rand()/RAND_MAX;}i T(v o,v d,f
&t,v&n){t=1e9;i m=0;f p=-o.z/d.z;if(.01
<p)t=p,n=v(0,0,1),m=1;for(i k=19;k--;)
for(i j=9;j--;)if(G[j]&1<<k){v p=o+v(-k
,0,-j-4);f b=p%d,c=p%p-1,q=b*b-c;if(q>0
){f s=-b-sqrt(q);if(s<t&&s>.01)t=s,n=!(
p+d*t),m=2;}}return m;}v S(v o,v d){f t
;v n;i m=T(o,d,t,n);if(!m)return v(.7,
.6,1)*pow(1-d.z,4);v h=o+d*t,l=!(v(9+R(
),9+R(),16)+h*-1),r=d+n*(n%d*-2);f b=l%
n;if(b<0||T(h,l,t,n))b=0;f p=pow(l%r*(b
>0),99);if(m&1){h=h*.2;return((i)(ceil(
h.x)+ceil(h.y))&1?v(3,1,1):v(3,3,3))*(b
*.2+.1);}return v(p,p,p)+S(h,r)*.5;}i
main(){printf("P6 512 512 255 ");v g=!v
(-6,-16,0),a=!(v(0,0,1)^g)*.002,b=!(g^a
)*.002,c=(a+b)*-256+g;for(i y=512;y--;)
for(i x=512;x--;){v p(13,13,13);for(i r
=64;r--;){v t=a*(R()-.5)*99+b*(R()-.5)*
99;p=S(v(17,16,8)+t,!(t*-1+(a*(R()+x)+b
*(y+R())+c)*16))*3.5+p;}printf("%c%c%c"
,(i)p.x,(i)p.y,(i)p.z);}}

If you are not familiar with the business card raytracer and you don't want to read the breakdown, it is the blob you can see on the right. It may look hairy but once cleaned up, card_base.cpp is pretty simple with three functions named main, Trace, and Sample.

There is a comment at the top of the listing instructing how to run the program but not how to compile it. The baseline was established with a borderline pathologic compilation command.

$> clang -o card -lm card.cpp
$> time ./card > card.ppm

real    1m41.829s
user    1m41.703s
sys     0m0.031s

I ran it on The Beautiful Machine^[3] which features a Ryzen 5 2600 with 6 cores capable of 12 threads at 3.5Ghz. It took 101 seconds to render the ppm image.

This was already an improvement compared to 2013 when a MacBook Air with a Core i5 running at 1.7Ghz resulted in 131s runtime.

A quick benchmark showed that the total cost of I/O (the printf() to generate the ppm) accounted for one millisecond total and therefore was negligible.

The first optimization on my list was to use SIMD instructions to speed up vector manipulation. Opening the binary with Binary Ninja revealed that clang had already managed to leverage the SSE registers.

In fact, attempting to disable SEE proved downright impossible since the -mno-see flag crashed Clang.

$> clang -o card -lm -mno-sse card.cpp
[...]
card.cpp:32:5: error: SSE register return with SSE disabled
fatal error: error in backend: Cannot pop empty stack!
clang: error: clang frontend command failed with exit code 70
int TraceRay(v origin, v destination, float &t, v &normal) {
    ^

Optimization level 1

With SIMD already taken care of, the next step was to play with compiler flags. Instead of going directely with the most agressive form of optimization (-O3), I took a detour to peek at other levels. gcc documentation^[4] explained -O1 well.

With -O1, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.

This improved performances by almost 50% and brought runtime down to 52s.

$> clang -o card -lm -O1 card.cpp 
$> time ./card > /dev/null

real    0m52.874s
user    0m52.750s
sys     0m0.016s

Comparing the instructions generated between O0 and O1 reveal that the Vector class method are still functions but some of them don't require a stack frame anymore like the vector::* operator.

v::operator*(float):
push    rbp
mov     rbp, rsp
sub     rsp, 48
mov     qword ptr [rbp - 24], rdi
movss   dword ptr [rbp - 28], xmm0
mov     rax, qword ptr [rbp - 24]
movss   xmm0, dword ptr [rbp - 28] # xmm0
movss   xmm1, dword ptr [rax]   # xmm1
movss   xmm2, dword ptr [rax + 4] # xmm2
mulss   xmm1, xmm0
mulss   xmm2, xmm0
movss   xmm3, dword ptr [rax + 8] # xmm3
mulss   xmm3, xmm0
lea     rdi, [rbp - 16]
movaps  xmm0, xmm1
movaps  xmm1, xmm2
movaps  xmm2, xmm3
call    v::v(float, float, float)
mov     ecx, dword ptr [rbp - 8]
mov     dword ptr [rbp - 40], ecx
mov     rax, qword ptr [rbp - 16]
mov     qword ptr [rbp - 48], rax
movsd   xmm0, qword ptr [rbp - 48] # xmm0
movss   xmm1, dword ptr [rbp - 40] # xmm1
add     rsp, 48
pop     rbp
ret

v::operator*(float):
sub     rsp, 24
movaps  xmm2, xmm0
movss   xmm0, dword ptr [rdi]   # xmm0
mulss   xmm0, xmm2
movss   xmm1, dword ptr [rdi + 4] # xmm1
mulss   xmm1, xmm2
mulss   xmm2, dword ptr [rdi + 8]
lea     rdi, [rsp + 8]
call    v::v(float, float, float)
movsd   xmm0, qword ptr [rsp + 8] # xmm0
movss   xmm1, dword ptr [rsp + 16] # xmm1
add     rsp, 24
ret

Left: No optimization. Right: -O1 optimization.

BinaryNinja was a good tool to look at individual binaries. However, when the purpose is comparison, I found Compiler Explorer to perform better. It is a free tool which output for this code can be seen here.

Optimization level 3

I skipped level 2 and went directly to level 3. At this level all optimizations are applied, regardless of compilation time and binary size. The documentation warn that increasing the volume of instruction may be detrimental to performance. Our instance does not fall under this condition and runtime dropped again to 11s.

$> clang -o card -lm -O3 card.cpp
$> time ./card > /dev/null

real    0m11.550s
user    0m11.531s
sys     0m0.000s

At this level, the Vector class is completely melted by the compiler, all v:: operators are gone from the symbol table. A particularly rich example is the normal assignment line found in TraceRay function.

int TraceRay(v origin, v destination, float &t, v &normal) {
  .
  normal = v(0, 0, 1);
  .
}

This line is compiled to 11 instructions with -O1 but only two with -O3.

movss   dword [rbx], xmm1
lea     rdi, [rsp+0x8 {var_a0}]
movss   xmm2, dword [rel data_401298]
xorps   xmm0, xmm0  {0x0}
xorps   xmm1, xmm1  {0x0}
call    v::v
mov     eax, dword [rsp+0x10 {var_98}]
mov     rcx, qword [rsp+0x28 {var_80}]
mov     dword [rcx+0x8], eax
mov     rax, qword [rsp+0x8 {var_a0}]
mov     qword [rcx], rax

mov     qword [r14], 0x0
mov     dword [r14+0x8], 0x3f800000

Left: -O1. Right: -O3.

On top of all the inlining (v::v constructor is gone, and the target is directly written into), clang moved some compile-time vector values from .text segment directly into the instruction stream in order to avoid reading operand from memory. This is the case for value 1.0f (0x3f800000).

Clang also detected two 4 bytes values set to zero next to each other and decided to set them with one 8-byte assignment (qword vs dword).

Using all cores

So far the raytracer was mono-threaded. For the next step, I threw all of the Ryzen 5 cores (12) at the problem. Minor refactoring resulted into card_cores.cpp which spawns 512 pthread to render segments of 512 continuous pixels.

$> clang -O3 -o card_cores -lm -lpthread card_cores.cpp
$> time ./card_cores > /dev/null

real    0m26.758s
user    0m34.531s
sys     4m35.453s

Oops. Throwing 12 cores at the task now takes twice as long to complete. I was surprised by the result since I was careful to avoid false sharing, an issue well described by Scott Meyers^[5] that arise when cache coherency dominates runtime. I had obviously failed to cuddle the cachelines. If you go back to card_cores.cpp, can you spot the mistake?

If you guessed libc's rand(), you were right. This version of the random generator maintains an internal state between calls which mean every core calling rand() made all other core cacheline containing that state dirty. The solution was to re-implement the random function generator with a thread local.

thread_local int seed = 1;
float R() {
  seed = ( 214013 * seed + 2531011);
  return ((seed >> 16) & 0xFFFF) / 66635.0f;
}

multi_thard.cpp showed tremendous improvement, achieving a 10x speedup compared to the mono-threaded version.

$> clang -O3 -o multi_thard -lm -lpthread multi_thard.cpp
$> time ./multi_thard > /dev/null 

real    0m1.291s
user    0m14.734s
sys     0m0.109s

At this point, the Business Card Raytracer runtime was down from 101 seconds to 1.3 seconds. I assessed it was close to the best I could do while relying only on a CPU. It was time to move onto something beefier.

Cuda

After playing with compiler optimizations and multi-threading, the next step was to involve the GPU. It was a good timing since I had just finished building myself a new PC^[6]. For the first time in 15 years my GPU (Pascal GeForce GTX 1050 Ti) was only one generation behind the latest Turing.

To do GPGPU programming there are two choices named OpenCL and CUDA. My years studying at University of id Software and the adage "open standards are Good and Portable" triggered so I contemplated using OpenCL. After a little bit of investigation it looked a lot like GLSL where you need to create buffers, bind variables, and createProgram by giving the source code of the kernel. I had painful memories of dealing with compiler implementation details with this method. There was some hope to use Spir-V and clCreateProgramWithBinary but reports of it not being supported on certain platforms buried it quick. In the end, I went with CUDA.

It turned out to be a delightful experience where a programmer only need to tag code/data location via special qualifiers such as __global__ and __device_ with nvcc compiler taking care of almost everything.

Learning CUDA and SIMT

There are a lot of books about CUDA. They all seem to have a Borg cube on their cover.

- "I bet nobody else will think of using a cube on the cover!" "Cuda by Example" was a good starter which took me from "HelloWorld" to something that compiled and ran without crashing. With minimal refactoring and relying mostly on qualifier keyword, I came up with a naive card_cuda.cu.

I was expecting a total performance carnage but things went better than expected.

C:\Users\fab>nvcc -o card_cuda -O3 card_cuda.cu
warning : Stack size for entry function '_Z8GetColorPh' cannot be statically determined  
C:\Users\fab>card_cuda > NUL
Time: 984ms

The warning is probably not a good thing (I'll get back to this later) but the program ran successfully. With zero optimization the GPU outperformed a 12-core CPU by 30%.

Taking it to the next level

To improve runtime further, I found "Professional CUDA C Programming" to be a gold mine. The book explains in great details how the CUDA core/stream multiprocessor work and how to profile/optimize a kernel. The rest of this article is rooted in this book, I highly recommend it.

Cuddling the Stream Multiprocessor

Over the past two decades, NVidia GPU architecture has evolved drastically^[7]. Fortunately, the programming model has remained unchanged. A GPU is fed thousands of threads. Those are grouped in packs of 32 (called warps) where all of them have the same Instruction Pointer (hence the name SIMT) and execute the same instruction in lockstep. Performance-wise, the most important thing is to keep the SM well fed with instructions, data, and warps.

Nvidia provides a profiling tool named nvprof which allows to measure metrics such as occupancy and thread divergence in a warp.

C:\Users\fab>nvprof --metrics achieved_occupancy,branch_efficiency card_cuda > NUL
Branch Efficiency             99.90%
Average Achieved Occupancy    0.123891

"Branch Efficiency" measures how much threads in a warp diverge. The figure for our raytracer is good. I was expecting it to perform badly since rays bounce everywhere but it seems spacial locality helped thread to be coherent.

"Achieved Occupancy" on the other side was abysmal (best is 1.0). It seems the GPU SMs were not able to keep enough warp in-flight which resulted in starvation. This is correlated with the warning nvcc issued. Because the raytracer uses recursion, it uses a lot of stacks. So much actually that the SM cannot keep more than a few alive.

The problem is the recursive nature of the code. That is why the compiler was unable to guess the stack size earlier. Revising the raytracer to use a loop instead of recursion (card_cudb.cu) reduced the amount of stack consumed. This translated in an occupancy twice as good and halved runtime!

C:\Users\fab>nvcc -o card_cudb -O3 card_cudb.cu
C:\Users\fab>nvprof --metrics achieved_occupancy card_cudb > NUL
Average Achieved Occupancy    0.282478
C:\Users\fab>card_cudb > NUL
Time: 546ms

Avoiding double precision floating point (FP64)

Another recommendation from the book was to avoid double precision if possible. nvprof showed that no FP64 instructions were used in the kernel.

C:\Users\fab>nvprof --metrics flop_count_dp,flop_count_sp card_cudb > NUL
Average Floating Point Operations(Double Precision)           0     
Average Floating Point Operations(Single Precision)  2.3327e+10

Using half-precision float (FP16)

Support for half-float was added by CUDA 7.5 in 2015. Initially this only helped to improve storage but cards based on Volta^[8] and Turing^[9] architecture provide a 2x performance boost when crunching FP16s. Sadly, previous generations such as my Pascal GTX 1050 Ti crunch FP16...64 times slower than FP32^[10]. Nothing to see here.

Using intrinsics

Next on the list was to use CUDA intrinsics. These functions cover every math operations from math.h and more. There are also variant offering to trade accuracy for speed.

#define POW         __powf
#define CEIL        ceilf
#define RSQRT(x)    rsqrtf(x)
#define SQRT(x)     sqrtf(x)
#define DIVIDE(x,y) __fdividef((x),(y))

The resulting intrinsic based card_cudc.cu proved twice as fast as the previous iteration.

C:\Users\fab>card_cudc > NUL 
Time: 218ms

Use Fast-math flag

Besides -O3, nvcc compiler driver also accept flag to control speed/accuracy tradeoff (ftz, prec-div, prec-sqrt, and fmad). They are all enabled with -use_fast_math and shaved another 15ms.

C:\Users\fab>nvcc -O3 -o card_cudc -use_fast_math card_cudc.cu
C:\Users\fab>card_cudc > NUL 
Time: 203ms

Skip second-stage (PTX) compilation

When nvcc runs, it does not generate GPU machine code but rather a portable bytecode called PTX. When the program runs the PTX is transformed into something specific to the GPU available. nvcc offers the ability to build fat binaries to skip PTX. Arnon Shimoni has a page summarizing the architectures and what parameters to use with them^[11].

Fermi   Kepler   Maxwell   Pascal    Volta     Turing    Ampere
sm_20    sm_30     sm_50    sm_60    sm_70       sm75     sm_80
         sm_35     sm_52    sm_61    sm_72
         sm_37     sm_53    sm_62

Unfortunately, generating a fat binary and skipping PTX generation did not result in better performance.

C:\Users\fab>nvcc -O3 -o card_cudc -gencode=arch=compute_60,code=sm_60 card_cudc.cu
C:\Users\fab>cuobjdump card_cudc.exe

Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = 
host = windows
compile_size = 64bit

Fatbin elf code:
================
arch = sm_60
code version = [1,7]
producer = 
host = windows
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_60
code version = [6,5]
producer = 
host = windows
compile_size = 64bit
compressed

C:\Users\fab>card_cudc > NUL 
Time: 203ms

Malloc wall?

nvprof does not have a way to generate an unified timeline trace. Even with "--print-gpu-trace", the best you can get is two sections with GPU activity on one side and API calls on the other.

C:\Users\fab>nvprof card_cudc > cudc.ppm
==3448== NVPROF is profiling process 3448, command: card_cudc
Time: 407ms
==3448== Profiling application: card_cudc
==3448== Profiling result: 
            Type  Time(%)      Time       Avg  Name
 GPU activities:   99.93%  88.634ms  88.634ms  GetColor(unsigned char*)
                    0.07%  60.703us  60.703us  [CUDA memcpy DtoH]
      API calls:   53.40%  147.01ms  147.01ms  cudaMalloc
                   32.22%  88.706ms  88.706ms  cudaDeviceSynchronize
                   14.03%  38.637ms  38.637ms  cuDevicePrimaryCtxRelease

It is not intuitive to read it but it looked like allocating a buffer on the GPU (cudaMalloc) was slow to the point it accounted for 53% of the runtime. NVidia Visual Profiler came handy to confirm the diagnostic.

C:\Users\fab>nvvp -vm "c:\Program Files (x86)\jdk-8\bin\java"

I wasn't really sure why malloc was so slow. After some googling, it turned out that it is not cudaMalloc that is slow but rather that a context is created on the first CUDA function call. Properly initializing the context with cudaSetDevice(0) shaved an extra 50ms.

C:\Users\fab>card_cudd > NUL
Time: 150ms

That is the last optimization I was able to come up with. I feel like I have barely scratched the surface of GPGPU optimization but it was a lot of fun. I will make sure to revisit the topic soon.

CUDA pain points

The world of CUDA was a lot of fun to explore but not devoid of pain points. Even on Windows, the development framework doesn't "just work". nvcc exploded on first run and I had to use some cryptic script provided by Visual Studio.

"c:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x86_amd64

The profiler nvprof also resulted in a weird error "missing dll, cupta64-102.dll not found". PATH had to be manually modified.

PATH = %PATH%;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\extras\CUPTI\lib64

By far the most capricious tool was the Visual profiler, nvvp, which refused to work with anything but OpenJDK 8.

nvvp -vm "c:\Program Files (x86)\jdk-8\bin\java"

Besides the tools, the worse problems are those related to CUDA execution. Occasionally you can be faced with enigmatic error messages.

Too Many Resources Requested for Launch

What are we supposed to do with that? According to cuda-programming.blogspot.com^[12] this means that the number of registers available on the multiprocessor is being exceeded. You either need to reduce the number of thread per block or reduce the stack frame sizes.

Last but not least, with limited memory per thread, it is super easy to trigger a stack overflow. The recursive nature of the initial implementation crashed the kernel.

GPUassert: an illegal memory access was encountered card.cu 155
(Stack overflow)

References

^ [ 1] Fast inverse square root
^ [ 2] Decyphering the Business Card Raytracer
^ [ 3] The Beautiful Machine
^ [ 4] 3.11 Options That Control Optimization
^ [ 5] Scott Meyers: Cpu Caches and Why You Care
^ [ 6] The Beautiful Machine
^ [ 7] An history of Nvidia Stream Multiprocessor
^ [ 8] NVIDIA Tesla V100
^ [ 9] NVIDIA TITAN RTX
^ [10] NVIDIA GeForce GTX 1050 Ti
^ [11] Matching SM architectures for various NVIDIA cards
^ [12] Handling cuda error messages

^	[ 1]	Fast inverse square root
^	[ 2]	Decyphering the Business Card Raytracer
^	[ 3]	The Beautiful Machine
^	[ 4]	3.11 Options That Control Optimization
^	[ 5]	Scott Meyers: Cpu Caches and Why You Care
^	[ 6]	The Beautiful Machine
^	[ 7]	An history of Nvidia Stream Multiprocessor
^	[ 8]	NVIDIA Tesla V100
^	[ 9]	NVIDIA TITAN RTX
^	[10]	NVIDIA GeForce GTX 1050 Ti
^	[11]	Matching SM architectures for various NVIDIA cards
^	[12]	Handling cuda error messages