What is the most beautiful piece of code you have ever seen?
Not only it is subjective, "beauty" is also a word that can mean a lot of things. If we were to be discussing of raw speed, I would bring up the Fast Inverse Square Root[1]. If we were to judge by aesthetic, my answer would be Andrew Kensler's, 1337 bytes, Business Card Raytracer.
Back in 2013 I spent a long time starring at Andrew's code in order to make sense of it[2]. I discussed at length how he managed to do so much with so little. Because of its real-estate constraint his program took a while to run and I remember wondering how fast it could have performed with some optimizations.
Since I have been staying at home a lot lately, I decided to spent some of my spare time on it. I started with SIMD, followed with compiler flags, then multi-core, and ended up involving the GPU with CUDA. It was a prolific week-end which saw the runtime drop from 101,000ms to 150ms!
#include <stdlib.h> // card > aek.ppm #include <stdio.h> #include <math.h> typedef int i;typedef float f;struct v{ f x,y,z;v operator+(v r){return v(x+r.x ,y+r.y,z+r.z);}v operator*(f r){return v(x*r,y*r,z*r);}f operator%(v r){return x*r.x+y*r.y+z*r.z;}v(){}v operator^(v r ){return v(y*r.z-z*r.y,z*r.x-x*r.z,x*r. y-y*r.x);}v(f a,f b,f c){x=a;y=b;z=c;}v operator!(){return*this*(1/sqrt(*this%* this));}};i G[]={247570,280596,280600, 249748,18578,18577,231184,16,16};f R(){ return(f)rand()/RAND_MAX;}i T(v o,v d,f &t,v&n){t=1e9;i m=0;f p=-o.z/d.z;if(.01 <p)t=p,n=v(0,0,1),m=1;for(i k=19;k--;) for(i j=9;j--;)if(G[j]&1<<k){v p=o+v(-k ,0,-j-4);f b=p%d,c=p%p-1,q=b*b-c;if(q>0 ){f s=-b-sqrt(q);if(s<t&&s>.01)t=s,n=!( p+d*t),m=2;}}return m;}v S(v o,v d){f t ;v n;i m=T(o,d,t,n);if(!m)return v(.7, .6,1)*pow(1-d.z,4);v h=o+d*t,l=!(v(9+R( ),9+R(),16)+h*-1),r=d+n*(n%d*-2);f b=l% n;if(b<0||T(h,l,t,n))b=0;f p=pow(l%r*(b >0),99);if(m&1){h=h*.2;return((i)(ceil( h.x)+ceil(h.y))&1?v(3,1,1):v(3,3,3))*(b *.2+.1);}return v(p,p,p)+S(h,r)*.5;}i main(){printf("P6 512 512 255 ");v g=!v (-6,-16,0),a=!(v(0,0,1)^g)*.002,b=!(g^a )*.002,c=(a+b)*-256+g;for(i y=512;y--;) for(i x=512;x--;){v p(13,13,13);for(i r =64;r--;){v t=a*(R()-.5)*99+b*(R()-.5)* 99;p=S(v(17,16,8)+t,!(t*-1+(a*(R()+x)+b *(y+R())+c)*16))*3.5+p;}printf("%c%c%c" ,(i)p.x,(i)p.y,(i)p.z);}}
If you are not familiar with the business card raytracer and you don't want to read the breakdown, it is the blob you can see on the right. It may look hairy but once cleaned up, card_base.cpp is pretty simple with three functions named main, Trace, and Sample.
There is a comment at the top of the listing instructing how to run the program but not how to compile it. The baseline was established with a borderline pathologic compilation command.
$> clang -o card -lm card.cpp $> time ./card > card.ppm real 1m41.829s user 1m41.703s sys 0m0.031s
I ran it on The Beautiful Machine[3] which features a Ryzen 5 2600 with 6 cores capable of 12 threads at 3.5Ghz. It took 101 seconds to render the ppm image.
This was already an improvement compared to 2013 when a MacBook Air with a Core i5 running at 1.7Ghz resulted in 131s runtime.
A quick benchmark showed that the total cost of I/O (the printf() to generate the ppm) accounted for one millisecond total and therefore was negligible.
The first optimization on my list was to use SIMD instructions to speed up vector manipulation. Opening the binary with Binary Ninja revealed that clang had already managed to leverage the SSE registers.
In fact, attempting to disable SEE proved downright impossible since the -mno-see flag crashed Clang.
$> clang -o card -lm -mno-sse card.cpp [...] card.cpp:32:5: error: SSE register return with SSE disabled fatal error: error in backend: Cannot pop empty stack! clang: error: clang frontend command failed with exit code 70 int TraceRay(v origin, v destination, float &t, v &normal) { ^
With SIMD already taken care of, the next step was to play with compiler flags. Instead of going directely with the most agressive form of optimization (-O3), I took a detour to peek at other levels. gcc documentation[4] explained -O1 well.
With -O1, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.This improved performances by almost 50% and brought runtime down to 52s.
$> clang -o card -lm -O1 card.cpp $> time ./card > /dev/null real 0m52.874s user 0m52.750s sys 0m0.016s
Comparing the instructions generated between O0 and O1 reveal that the Vector class method are still functions but some of them don't require a stack frame anymore like the vector::* operator.
v::operator*(float): push rbp mov rbp, rsp sub rsp, 48 mov qword ptr [rbp - 24], rdi movss dword ptr [rbp - 28], xmm0 mov rax, qword ptr [rbp - 24] movss xmm0, dword ptr [rbp - 28] # xmm0 movss xmm1, dword ptr [rax] # xmm1 movss xmm2, dword ptr [rax + 4] # xmm2 mulss xmm1, xmm0 mulss xmm2, xmm0 movss xmm3, dword ptr [rax + 8] # xmm3 mulss xmm3, xmm0 lea rdi, [rbp - 16] movaps xmm0, xmm1 movaps xmm1, xmm2 movaps xmm2, xmm3 call v::v(float, float, float) mov ecx, dword ptr [rbp - 8] mov dword ptr [rbp - 40], ecx mov rax, qword ptr [rbp - 16] mov qword ptr [rbp - 48], rax movsd xmm0, qword ptr [rbp - 48] # xmm0 movss xmm1, dword ptr [rbp - 40] # xmm1 add rsp, 48 pop rbp ret
v::operator*(float): sub rsp, 24 movaps xmm2, xmm0 movss xmm0, dword ptr [rdi] # xmm0 mulss xmm0, xmm2 movss xmm1, dword ptr [rdi + 4] # xmm1 mulss xmm1, xmm2 mulss xmm2, dword ptr [rdi + 8] lea rdi, [rsp + 8] call v::v(float, float, float) movsd xmm0, qword ptr [rsp + 8] # xmm0 movss xmm1, dword ptr [rsp + 16] # xmm1 add rsp, 24 retLeft: No optimization. Right: -O1 optimization.
BinaryNinja was a good tool to look at individual binaries. However, when the purpose is comparison, I found Compiler Explorer to perform better. It is a free tool which output for this code can be seen here.
I skipped level 2 and went directly to level 3. At this level all optimizations are applied, regardless of compilation time and binary size. The documentation warn that increasing the volume of instruction may be detrimental to performance. Our instance does not fall under this condition and runtime dropped again to 11s.
$> clang -o card -lm -O3 card.cpp $> time ./card > /dev/null real 0m11.550s user 0m11.531s sys 0m0.000s
At this level, the Vector class is completely melted by the compiler, all v:: operators are gone from the symbol table. A particularly rich example is the normal assignment line found in TraceRay function.
int TraceRay(v origin, v destination, float &t, v &normal) { . normal = v(0, 0, 1); . }This line is compiled to 11 instructions with -O1 but only two with -O3.
movss dword [rbx], xmm1 lea rdi, [rsp+0x8 {var_a0}] movss xmm2, dword [rel data_401298] xorps xmm0, xmm0 {0x0} xorps xmm1, xmm1 {0x0} call v::v mov eax, dword [rsp+0x10 {var_98}] mov rcx, qword [rsp+0x28 {var_80}] mov dword [rcx+0x8], eax mov rax, qword [rsp+0x8 {var_a0}] mov qword [rcx], rax
mov qword [r14], 0x0 mov dword [r14+0x8], 0x3f800000Left: -O1. Right: -O3.
On top of all the inlining (v::v constructor is gone, and the target is directly written into), clang moved some compile-time vector values from .text segment directly into the instruction stream in order to avoid reading operand from memory. This is the case for value 1.0f (0x3f800000).
Clang also detected two 4 bytes values set to zero next to each other and decided to set them with one 8-byte assignment (qword vs dword).
So far the raytracer was mono-threaded. For the next step, I threw all of the Ryzen 5 cores (12) at the problem. Minor refactoring resulted into card_cores.cpp which spawns 512 pthread to render segments of 512 continuous pixels.
$> clang -O3 -o card_cores -lm -lpthread card_cores.cpp $> time ./card_cores > /dev/null real 0m26.758s user 0m34.531s sys 4m35.453s
Oops. Throwing 12 cores at the task now takes twice as long to complete. I was surprised by the result since I was careful to avoid false sharing, an issue well described by Scott Meyers[5] that arise when cache coherency dominates runtime. I had obviously failed to cuddle the cachelines. If you go back to card_cores.cpp, can you spot the mistake?
If you guessed libc's rand(), you were right. This version of the random generator maintains an internal state between calls which mean every core calling rand() made all other core cacheline containing that state dirty. The solution was to re-implement the random function generator with a thread local.
thread_local int seed = 1; float R() { seed = ( 214013 * seed + 2531011); return ((seed >> 16) & 0xFFFF) / 66635.0f; }
multi_thard.cpp showed tremendous improvement, achieving a 10x speedup compared to the mono-threaded version.
$> clang -O3 -o multi_thard -lm -lpthread multi_thard.cpp $> time ./multi_thard > /dev/null real 0m1.291s user 0m14.734s sys 0m0.109s
At this point, the Business Card Raytracer runtime was down from 101 seconds to 1.3 seconds. I assessed it was close to the best I could do while relying only on a CPU. It was time to move onto something beefier.
After playing with compiler optimizations and multi-threading, the next step was to involve the GPU. It was a good timing since I had just finished building myself a new PC[6]. For the first time in 15 years my GPU (Pascal GeForce GTX 1050 Ti) was only one generation behind the latest Turing.
To do GPGPU programming there are two choices named OpenCL and CUDA. My years studying at University of id Software and the adage "open standards are Good and Portable" triggered so I contemplated using OpenCL. After a little bit of investigation it looked a lot like GLSL where you need to create buffers, bind variables, and createProgram by giving the source code of the kernel. I had painful memories of dealing with compiler implementation details with this method. There was some hope to use Spir-V and clCreateProgramWithBinary but reports of it not being supported on certain platforms buried it quick. In the end, I went with CUDA.
It turned out to be a delightful experience where a programmer only need to tag code/data location via special qualifiers such as __global__ and __device_ with nvcc compiler taking care of almost everything.
There are a lot of books about CUDA. They all seem to have a Borg cube on their cover.
C:\Users\fab>nvcc -o card_cuda -O3 card_cuda.cu warning : Stack size for entry function '_Z8GetColorPh' cannot be statically determined C:\Users\fab>card_cuda > NUL Time: 984ms
The warning is probably not a good thing (I'll get back to this later) but the program ran successfully. With zero optimization the GPU outperformed a 12-core CPU by 30%.
To improve runtime further, I found "Professional CUDA C Programming" to be a gold mine. The book explains in great details how the CUDA core/stream multiprocessor work and how to profile/optimize a kernel. The rest of this article is rooted in this book, I highly recommend it.
Over the past two decades, NVidia GPU architecture has evolved drastically[7]. Fortunately, the programming model has remained unchanged. A GPU is fed thousands of threads. Those are grouped in packs of 32 (called warps) where all of them have the same Instruction Pointer (hence the name SIMT) and execute the same instruction in lockstep. Performance-wise, the most important thing is to keep the SM well fed with instructions, data, and warps.
Nvidia provides a profiling tool named nvprof which allows to measure metrics such as occupancy and thread divergence in a warp.
C:\Users\fab>nvprof --metrics achieved_occupancy,branch_efficiency card_cuda > NUL Branch Efficiency 99.90% Average Achieved Occupancy 0.123891
"Branch Efficiency" measures how much threads in a warp diverge. The figure for our raytracer is good. I was expecting it to perform badly since rays bounce everywhere but it seems spacial locality helped thread to be coherent.
"Achieved Occupancy" on the other side was abysmal (best is 1.0). It seems the GPU SMs were not able to keep enough warp in-flight which resulted in starvation. This is correlated with the warning nvcc issued. Because the raytracer uses recursion, it uses a lot of stacks. So much actually that the SM cannot keep more than a few alive.
The problem is the recursive nature of the code. That is why the compiler was unable to guess the stack size earlier.
Revising the raytracer to use a loop instead of recursion (card_cudb.cu) reduced the amount of stack consumed. This translated in an occupancy twice as good and halved runtime!
C:\Users\fab>nvcc -o card_cudb -O3 card_cudb.cu C:\Users\fab>nvprof --metrics achieved_occupancy card_cudb > NUL Average Achieved Occupancy 0.282478 C:\Users\fab>card_cudb > NUL Time: 546ms
Another recommendation from the book was to avoid double precision if possible. nvprof showed that no FP64 instructions were used in the kernel.
C:\Users\fab>nvprof --metrics flop_count_dp,flop_count_sp card_cudb > NUL Average Floating Point Operations(Double Precision) 0 Average Floating Point Operations(Single Precision) 2.3327e+10
Support for half-float was added by CUDA 7.5 in 2015. Initially this only helped to improve storage but cards based on Volta[8] and Turing[9] architecture provide a 2x performance boost when crunching FP16s. Sadly, previous generations such as my Pascal GTX 1050 Ti crunch FP16...64 times slower than FP32[10]. Nothing to see here.
Next on the list was to use CUDA intrinsics. These functions cover every math operations from math.h and more. There are also variant offering to trade accuracy for speed.
#define POW __powf #define CEIL ceilf #define RSQRT(x) rsqrtf(x) #define SQRT(x) sqrtf(x) #define DIVIDE(x,y) __fdividef((x),(y))
The resulting intrinsic based card_cudc.cu proved twice as fast as the previous iteration.
C:\Users\fab>card_cudc > NUL Time: 218ms
Besides -O3, nvcc compiler driver also accept flag to control speed/accuracy tradeoff (ftz, prec-div, prec-sqrt, and fmad). They are all enabled with -use_fast_math and shaved another 15ms.
C:\Users\fab>nvcc -O3 -o card_cudc -use_fast_math card_cudc.cu C:\Users\fab>card_cudc > NUL Time: 203ms
When nvcc runs, it does not generate GPU machine code but rather a portable bytecode called PTX. When the program runs the PTX is transformed into something specific to the GPU available. nvcc offers the ability to build fat binaries to skip PTX. Arnon Shimoni has a page summarizing the architectures and what parameters to use with them[11].
Fermi Kepler Maxwell Pascal Volta Turing Ampere sm_20 sm_30 sm_50 sm_60 sm_70 sm75 sm_80 sm_35 sm_52 sm_61 sm_72 sm_37 sm_53 sm_62
Unfortunately, generating a fat binary and skipping PTX generation did not result in better performance.
C:\Users\fab>nvcc -O3 -o card_cudc -gencode=arch=compute_60,code=sm_60 card_cudc.cu C:\Users\fab>cuobjdump card_cudc.exe Fatbin elf code: ================ arch = sm_60 code version = [1,7] producer =host = windows compile_size = 64bit Fatbin elf code: ================ arch = sm_60 code version = [1,7] producer = host = windows compile_size = 64bit Fatbin ptx code: ================ arch = sm_60 code version = [6,5] producer = host = windows compile_size = 64bit compressed C:\Users\fab>card_cudc > NUL Time: 203ms
nvprof does not have a way to generate an unified timeline trace. Even with "--print-gpu-trace", the best you can get is two sections with GPU activity on one side and API calls on the other.
C:\Users\fab>nvprof card_cudc > cudc.ppm ==3448== NVPROF is profiling process 3448, command: card_cudc Time: 407ms ==3448== Profiling application: card_cudc ==3448== Profiling result: Type Time(%) Time Avg Name GPU activities: 99.93% 88.634ms 88.634ms GetColor(unsigned char*) 0.07% 60.703us 60.703us [CUDA memcpy DtoH] API calls: 53.40% 147.01ms 147.01ms cudaMalloc 32.22% 88.706ms 88.706ms cudaDeviceSynchronize 14.03% 38.637ms 38.637ms cuDevicePrimaryCtxRelease
It is not intuitive to read it but it looked like allocating a buffer on the GPU (cudaMalloc) was slow to the point it accounted for 53% of the runtime. NVidia Visual Profiler came handy to confirm the diagnostic.
C:\Users\fab>nvvp -vm "c:\Program Files (x86)\jdk-8\bin\java"
I wasn't really sure why malloc was so slow. After some googling, it turned out that it is not cudaMalloc that is slow but rather that a context is created on the first CUDA function call. Properly initializing the context with cudaSetDevice(0) shaved an extra 50ms.
C:\Users\fab>card_cudd > NUL Time: 150ms
That is the last optimization I was able to come up with. I feel like I have barely scratched the surface of GPGPU optimization but it was a lot of fun. I will make sure to revisit the topic soon.
The world of CUDA was a lot of fun to explore but not devoid of pain points. Even on Windows, the development framework doesn't "just work". nvcc exploded on first run and I had to use some cryptic script provided by Visual Studio.
"c:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" x86_amd64
The profiler nvprof also resulted in a weird error "missing dll, cupta64-102.dll not found". PATH had to be manually modified.
PATH = %PATH%;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\extras\CUPTI\lib64
By far the most capricious tool was the Visual profiler, nvvp, which refused to work with anything but OpenJDK 8.
nvvp -vm "c:\Program Files (x86)\jdk-8\bin\java"
Besides the tools, the worse problems are those related to CUDA execution. Occasionally you can be faced with enigmatic error messages.
Too Many Resources Requested for Launch
What are we supposed to do with that? According to cuda-programming.blogspot.com[12] this means that the number of registers available on the multiprocessor is being exceeded. You either need to reduce the number of thread per block or reduce the stack frame sizes.
Last but not least, with limited memory per thread, it is super easy to trigger a stack overflow. The recursive nature of the initial implementation crashed the kernel.
GPUassert: an illegal memory access was encountered card.cu 155 (Stack overflow)