September 16th, 2011

Quake 2 Source Code Review 3/4

Quake2 software render is the biggest, most complicated and hence most exciting module to explore.

Quake 2 Source Code Review 1/4 (Intro)
Quake 2 Source Code Review 2/4 (Polymorphism)
Quake 2 Source Code Review 3/4 (Software Renderer)
Quake 2 Source Code Review 4/4 (OpenGL Renderer)

Software Renderer

From disk to pixel there are no hidden mecanisms here: Everything has been carefully coded and optimized by hand. It is the last of its kind, marking the end of an era before the industry moved to hardware accelerated only.

The fundamental thing that differentiate the software and OpenGL renderer is the usage of a 256 colors palette system instead of the usual 24 bits RGB true color system found everywhere nowaday:

If we compare the hardware accelerated and the software renderer, the two most obvious differences are:

But except for those the engine managed to do amazing things via very clever arrangements of the palette that I detail later in this page:

The Quake2 palette is loaded first from the PAK archive file pics/colormap.pcx:

Note : Black is at 0, White at 15, Green at 208 Red at the bottom left is 240 (transparent is 255).

The first thing done is to use those 256 colors and re-arrange them as seen in pics/colormap.pcx:

This 256x320 re-arrangement is used as lookup table and is incredibly clever, unlocking so many many sweat things:

Trivia : Quake2's software renderer was initially supposed to be RGB based instead of palette based thank to MMX technology as John Carmack announced it after the release of Quake1 in this video (at 10m17s mark):

MMX is a SIMD technology that would have allowed to work on the three channels of RGB at the cost of one channel hence enabling blending with an acceptable CPU consumption. My guess is that it was scrapped for two reasons:

Global architecture

Now that the big limitation is set (palette), we can go on the global architecture of the renderer. The philosophy is to rely heavily on what the Pentium is good at (floating point calculation) in order to minimize the impact of its weakness: Bus speed that affects writing pixels to memory. Most of the rendering path focuses on achieving ZERO overdraw. In essence the software rendering path is identical to Quake 1's software renderer relying heavily on the BSP to traverse the map and the PVS (Potentially visible set) to build the set of polygons to render. Three different things are rendered each frames:

Note : If you are unfamiliar with those "old" algorithms, I highly recommend to read Chapter 3.6 and 15.6 from Computer graphics: principles and practice By James D. Foley. You can also find a lot of information in chapters 59 to 70 in Graphics programming black book by Michael Abrash.

Here it is a pseudo-highlevel-code:

  1. Render the map
    • Walk the pre-processed BSP to determine in which cluster we are.
    • Query the PVS database for this particular cluster: Retrieve and decompress the PVS.
    • Using the PVS bit vector: Mark as visible every polygon belonging to a cluster designated as Potentially Visible.
    • Walk the BSP again, this time check if visible cluster are affected by any dynamic light. If so, mark them with the light's id number.
    • Now we have a list of what is potentially visible with lighting information:
      • Use the BSP again, walk it near to far:
        • Test against view frustrum and project all polygons on screen: Build a Global Edge Table.
        • Render each face's surface (surface=texture+lightmap) to a cache in RAM.
        • Insert surface pointer on the surface stack (later used for coherency).
      • Using the surface stack and the Global Edge Table, generate an Active Edge Table and render the screen line by line from top to bottom. Use the surface cache in RAM as texture. Write spans to offscreen buffer and also populate a Z-Buffer.
  2. Render the entities, use the Z-Buffer to determine which part of the entity is visible.
  3. Render translucent textures, use a cool palette trick to calculate the translucent palette indices per pixel.
  4. Render particles.
  5. Perform post-effect (damage blend full screen with a red color).

Screen rendition : Palette indices are written to an offscreen buffer. Depending if the engine runs fullscreen or windowed, DirectDraw or GDI are respectively used. When the offscreen framebuffer has been completed, it is blitted to the videocard backbuffer (GDI=>rw_dib.c DirectDraw=>rw_ddraw.c) using either WinGDI.h's BitBlt or ddraw.h's BltFast.

DirectDraw Vs GDI

The kind of things programmers hard to deal with around 1997 is just saddening: See John Carmack's funny comment in the source code:

	// can be negative for stupid dibs

If Quake2 was running fullscreen with DirectDraw it had to draw the offscreen buffer from top to bottom: the way it would appear on the screen. But if it was running in windowed mode with Window's GDI, it had to draw the offscreen buffer in a DIB buffer.....VERTICALLY FLIPPED because most Graphic vendors drivers were blitting the DIB image from RAM to VRAM in flipmode (you have to wonder if the 'I' in GDI really stands for "Independent").

So the offscreen framebuffer navigation had to be abstracted via a structure and values were initialized differently to abstract the difference. This is a crude way to do polymorphism in C ;).

    typedef struct
	    pixel_t        *buffer;         // The offscreen framebuffer
	    pixel_t        *colormap;       // The color gradient palette: 256 * VID_GRADES size
	    pixel_t        *alphamap;       // The translucency palette  : 256 * 256 translucency map
	    int            rowbytes;        // may be > width if displayed in a window
                                        // can be negative for stupid dibs
	    int            width;          
	    int            height;
    } viddef_t;
    viddef_t vid ;

Depending on the way the framebuffer had to be drawn, vid.buffer was initialized to either the first pixel of:

In order to navigate up or down, vid.rowbytes was either initialized to vid.width or -vid.width.

Now to navigate whether we had to draw normally or vertically flipped doesn't matter:

    // First pixel of first row
    byte* pixel = vid.buffer + vid.rowbytes * 0;
    // First pixel of last row
    pixel = vid.buffer + vid.rowbytes * (vid.height-1);
    // First pixel of row i (row numbering starts at 0)
    pixel = vid.buffer + vid.rowbytes * i;
    // Clear the framebuffer to black
    memset(vid.buffer,0,vid.height*vid.height) ;   // <-- This will work in fullScreen DirectDraw mode 
                                                   //     but crash in windowed mode :( !

This trick allowed the entire rendition pipeline to never worry about the underlying blitter and I think it is pretty awesome.

Map preprocessing

Before digging in the code further it is important to understand two important databases generated during map preprocessing:

BSP Slicing, PVS Generation

I invite you to read more about Binary Space Partitioning:

Just like in Quake1, Quake2 maps are heavily pre-processed : The volume is sliced recursively as in the following drawing:

In the end, the map had been sliced in convex 3D spaces (called clusters) and just like in Doom and Quake1 it can be used to sort all polygons around Near to Far or Far to Near.

The great addition is the PVS (Potentially Visible Set), a set of bit vector (one entry per cluster) that can be seen as a database that can be queried to retrieve what clusters are Potentially Visible from any cluster. This database is huge (5MB) but efficiently compressed to a few hundred of KB so it can fit nicely in RAM.

Note : The Run-length compression used to shrink the PVS can skip only 0x00s, this process is detailed later.


Just like in Quake1 the map's light impact were precalculated and stored in textures called lightmaps . Contrary to Quake1, Quake2 used radiosity and colored light during the precalculation. The lightmaps were then stored in the PAK archive and used at runtime:

A few words from one of the creator: Michael Abrash in "Black Book of Programming" (Chapitre: "Quake: a post-mortem and a glimpse into the future"):

The most interesting graphics change is in the preprocessing, where John has added support for radiosity lighting...

Quake 2 levels are taking up to an hour to process.
(Note, however, that that includes BSPing, PVS calculations, and radiosity lighting, which I'll discuss later.)

If you don't know what radiosity illumination is, read this extremely well illustrated article (mirror) : it is pure gold .

Here is the first level, textured with the radiosity texture only: Unfortunately the gorgeous colored RGB had to be resampled to grayscale for the software renderer (more about this later).

The low resolution of the lightmap is striking here but since they are bilinear filtered (yes even in the software renderer) the end result combined to a color texture is very good.

Trivia :The lightmaps could be any size from 2x2 up to 17x17 (contrary to the stated maxsize of 16 in flipcode's article (mirror)) and did not have to be square.

Code architecture

Most of the software renderer code is in the method R_RenderFrame. Here is a summary but for a more complete breakdown check out my raw notes.

       R_SetupFrame            // Set pdrawfunc function pointer, depending if we are rendering for GDI or DirectDraw
          Mod_PointInLeaf      // Determine what is the current view cluster (walking the BSP tree) and store it in r_viewcluster
       R_MarkLeaves            // Using the current view cluster (r_viewcluster), retrieve and decompress 
                               // the PVS (Potentially Visible Set)
       R_PushDlights           // For each dlight_t* passed via r_newrefdef.dlights, mark polygons affected by a light.
                               // Build the Global Edge Table and render it via the Active Edge Table 
       R_EdgeDrawing           // Render the map	
          R_BeginEdgeFrame     // Set function pointer pdrawfunc used later in this function 
          R_RenderWorld        // Build the Global Edget Table 
                               // Also populate the surface stack and count # surfaces to render (surf_max is the max)
          R_DrawBEntitiesOnList// I seriously have no idea what this thing is doing.
          R_ScanEdges          // Use the Global Edge Table to maintin the Active Edge Table: Draw the world as scanlines  
                               // Write the Z-Buffer (but no read)                              
              for (iv=r_refdef.vrect.y ; iv<bottom ; iv++)
                R_InsertNewEdges     // Update AET with GET event
                (*pdrawfunc)();      //Generate spans
                D_DrawSurfaces       //Draw stuff on screen
                // Flush span buffer
                R_InsertNewEdges     // Update AET with GET event
                (*pdrawfunc)();      //Generate spans
                D_DrawSurfaces       //Draw stuff on screen
       R_DrawEntitiesOnList    // Draw enemies, barrel etc...	
                               // Use Z-Buffer in read mode only.
       R_DrawParticles	       // Duh !
       R_DrawAlphaSurfaces     // Perform pixel palette blending	ia the pics/colormap.pcx lower part lookup table.
       R_SetLightLevel	       // Save off light value for server to look at (BIG HACK!)

       R_CalcPalette           // Modify the palette (when taking hit or pickup item) so all colors are modified     

R_SetupFrame: BSP Manipulation

Binary Space Partition walking is done everywhere in the code and it's a powerful constant speed mechanism to sort polygons near to far or far to near. The key is to understand the cplane_t structure:

    typedef struct cplane_s
        vec3_t   normal;
        float    dist;    
    } cplane_t;

In order to calculate the distance or a point from a splitting plan in a node, we only have to inject its coordinates in the plane equation:

    int d = DotProduct (p,plane->normal) - plane->dist;

With the sign of d we know if we are in front or behind the plane and sorting can be done. It is a process that is used from Doom to Quake3.

R_MarkLeaves: Runlength PVS decompression

The way I understood PVS decompression in my Quake 1 Engine code review was plain wrong: It is not the distance between 1 bits that is encoded but only the number of 0x00 bytes to write. Only 0x00 is runlength compressed in the PVS: Upon reading the compressed stream:

In the first case: Nothing was compressed. Runlength compression only really occurs in the second case:

    byte *Mod_DecompressVis (byte *in, model_t *model)
        static byte  decompressed[MAX_MAP_LEAFS/8];  // A leaf = 1bit so only 65536 / 8 = 8,192 bytes
                                                     // are needed to store a fully decompressed PVS.
        int   c;
        byte   *out;
        int    row;

        row = (model->vis->numclusters+7)>>3;	
        out = decompressed;

            if (*in)            // This is a non null byte, write it as is and continue to the next compressed byte.
                *out++ = *in++;
        c = in[1];              // We did not "continue" so this is a null byte: Read the next byte (in[1]) and write as
        in += 2;                // many byte as requested in the uncompressed PVS.
        while (c)
            *out++ = 0;

        } while (out - decompressed < row);
        return decompressed;


We can skip at most 255 bytes (255*8 leafs), a new zero must be issued with a number after to skip an other 255 bytes if necessary. Skipping 511 bytes (511*8 leafs) hence consumes 4 bytes: 0 - 255 - 0 - 255


    // Example of a system with 80 leafs, represented on 10 bytes: visible=1, invisible=0

    Binary      Uncompressed PVS
      0000 0000    0000 0000  0000 0000   0000 0000    0011 1100     1011 1111    0000 0000    0000 0000    0000 0000   0000 0000 
    Hexadecimal Uncompressed PVS
      0x00 0x00 0x00 0x00 0x39 0xBF 0x00 0x00 0x00 0x00
    // !! COMPRESSION !!  
    Binary      Compressed PVS
      0000 0000    0000 1000  0011 1100     1011 1111    0000 0000    0000 1000
    Hexadecimal Compressed PVS
      0x00 0x04 0x39 0xBF 0x00 0x04     

Once the PVS of the current cluster has been decompressed, every single face belonging to a cluster designated as visible by the PVS is marked as visible too:

Q : With a decompressed PVS, given a cluster ID i, how you test if it is visible ?
A : Bit AND between the i/8's byte of the PVS and 1 << (i % 8)

    char isVisible(byte* pvs, int i)
	    // i>>3 = i/8    
	    // i&7  = i%8
	    return pvs[i>>3] & (1<<(i&7))

Just like in Quake1, there is a nice trick used to mark a polygon as "visible": Instead of using a flag and have to reset all of them at the beginning of each frame, an int was used. Every beginning of frame a framecounter r_visframecount was increased by one in R_MarkLeaves. After decompressing the PVS, all zones are marked visible by setting its visframe field to the current value of r_visframecount.
Later in the code, node/cluster visibility is always tested as follow:

    if (node->visframe == r_visframecount)
       // Node is visible

R_PushDlights: Dynamic lighting

For each active dynamic light lightID the BSP walked recursively starting from the light's position. Calculate the distance light <-> Cluster and if the light intensity if greater than the distance from the cluster, mark all surfaces in the cluster as affected by this light ID.

Note : Surfaces are marked via two fields:

R_EdgeDrawing: Map rendition

R_EdgeDrawing is the kraken of the software renderer, the most complex and hardest to understand. The key to get it is to see the main data structure: A stack of surf_t (acting as proxy to m_surface_t) allocated on the CPU stack.


    //Surface stack system                                                                     
    surf_t  *surfaces  ;   // Surface stack array                                              
    surf_t  *surface_p ;   // Surface stack top                                                
    surf_t  *surf_max  ;   // Surface stack upper limit     
    // surfaces are generated in back to front order by the bsp, so if a surf                  
    // pointer is greater than another one, it should be drawn in front                        
    // surfaces[1] is the background, and is used as the active surface stack                  

    Note: The first element in surfaces is the dummy surface
    Note: The second element in surfaces is the background surface

This stack is populated when walking the BSP near to far. Each visible polygon is inserted in the stack as a surface proxy. Later when walking an Active Edge Table to generate a line of the screen, it allows to see very fast which polygon is in front of all the others by simply comparing the address in memory (the lower in the stack = the closer to the Point of View). This is how "coherence" is done for the scan-line conversion algorithm.

Note : Each entry in the stack also features a linked list (called Texture Chain), pointing to elements in the span buffer stack (detailed just after). Spans are stored in a buffer and drawn from the texture chain in order to group span per texture and maximize the CPU pre-caching buffer.

The stack is initialized at the very beginning of R_EdgeDrawing:

    void R_EdgeDrawing (void)
        // Total size: 64KB
        surf_t	lsurfs[NUMSTACKSURFACES +((CACHE_SIZE - 1) / sizeof(surf_t)) + 1];

        surfaces =  (surf_t *) (((long)&lsurfs[0] + CACHE_SIZE - 1) & ~(CACHE_SIZE - 1));
        surf_max = &surfaces[r_cnumsurfs];

        // surface 0 doesn't really exist; it's just a dummy because index 0
        // is used to indicate no edge attached to surface
        R_SurfacePatch ();

        R_BeginEdgeFrame ();    // surface_p = &surfaces[2];	
                                // background is surface 1,
                                // surface 0 is a dummy
        R_ScanEdges          // Write the Z-Buffer (but no read)                              
              for (iv=r_refdef.vrect.y ; iv<bottom ; iv++)
                R_InsertNewEdges     // Update AET with GET event
                (*pdrawfunc)();      //Generate spans
                D_DrawSurfaces       //Draw stuff on screen
                                     // Flush span buffer
                R_InsertNewEdges     // Update AET with GET event
                (*pdrawfunc)();      //Generate spans
                D_DrawSurfaces       //Draw stuff on screen

Here are the details:

R_EdgeDrawing video

In the following video the engine was running at 1024x768. It was also slowed down and a special cvar sw_drawflat 1 was enabled to render polygons without textures but with different colors:

There are soooo many things to notice in this video:

  1. Screen is generated top to bottom this is a typical scanline algorithm.
  2. Horizontal span of pixels are not written as generated. This is done to optimized the Pentium pre-fetching cache: Spans are grouped by textureId with a mechanism called "texture chains". Spans are stored in a buffer. When the buffer is full: spans are drawn with texture chains in ONE batch.
  3. The moment when the span buffer is full and a rendition batch is triggered is obvious since the polygon rendering stops around the 50th pixels line .
  4. Spans are generated top to bottom but they are drawn bottom to top: Since spans are inserted in a texture chain at the top of the listed list, spans are drawn in the reverse order they were produced.
  5. The last batch is covering 40% of the screen while the first one covered 10%. It is because they are far less polygon there, the spans are not cuts and they cover much more area.
  6. OMG ZERO OVERDRAW during the solid world rendition step .

Precalculated Radiosity: Lightmaps

Unfortunately the nice 24bits RGB lightmaps had to be resampled to 6bits grayscale (selecting the brightest channel between R, G and B) in order to fit the palette limitation:

What is stored in the PAK archive (24bits):

After loading from disk, and resampling to 6bits:

Now all together: The face's texture give the color in the range [0,255]. This value indexes a color in the palette from pics/colormap.pcx:

The lightmaps are filtered: we end up with a value [0,63].

Now using the upper part of pics/colormap.pcx the engine can select the right palette entry: Using the texture entry value as X coordinate and the luminance*63 as Y coordinate for the final result:

Et voila:

I personally think it is awesome to manage to fake 64 gradients of 256 colors....will only 256 colors :) !

Surface subsystem

According to the previous screenshots it is obvious that surface generation is the most CPU intensive part of Quake2 (this is confirmed by the profiler results seen later in this page). Two mechanisms make surface generation affordable in terms of speed and memory consumption:

Surface subsystem: Mipmapping

When a polygon is projected in screen space, its distance 1/Z is generated. The closest vertex is used to see which mipmap level should be used. Here is an example of a lightmap and how it is filtered according to the mipmap level:

Here is a mini project I worked on in order to test Quake2's bilinear filtering quality on random images:

Following are the result of a test done on a 13x15 texels lightmap:

Mimpap level 3: Block are 2x2 texels. Mimpap level 2: Block are 4x4 texels.

Mimpap level 1: Block are 8x8 texels. Mimpap level 0: Block are 16x16 texels.

The key to understand the filtering is that everything is based on the polygons world space dimension ( width and height are called extent):

In the following drawing the polygon dimension are (3,4), the lightmap is (4,5) texels and the surface degenerated will ALWAYS be (3,4) blocks. The mipmap levels decides the texels dimensions of a block and hence the overall size of the surface in texels.

All this is done in R_DrawSurface. The mipmap level is selected thanks to an array of function pointer (surfmiptable) selecting the right rasterizing function:

    static void	(*surfmiptable[4])(void) = {


   	pblockdrawer = surfmiptable[r_drawsurf.surfmip];					        
	for (u=0 ; u<r_numhblocks; u++)						        



In the following altered engine you can see three levels of mipmap: 0 in grey, 1 in yellow and 2 in red:

Filtering is done by block, when generating the surface's block[i][j], the filterer would use the lightmap values: lightmap[i][j],lightmap[i+1][j],lightmap[i][j+1] and lightmap[i+1][j+1]: essentially using the four texels located directly at the coordinate and the three ones below and to the right. Note that is is not doing weighted interpolation but rather linear interpolation vertically first and the interpolating horizontally again on the values generated. In short, it is working exactly as the wikipedia article about bilinear filtering is describing it.

All together now :

Original Lightmap 13x15 texels. Mimpap level 0 filtered (16x16 blocks)=192x224 texels.

Surface subsystem: Caching

Even though the engine relied massively on malloc and free for memory management it still used its own memory manager for surface caching. A block of memory was initialized as soon as the rendition resolution was known:

    size = SURFCACHE_SIZE_AT_320X240; 

    pix = vid.width*vid.height;		
    if (pix > 64000)				
        size += (pix-64000)*3;	

    size = (size + 8191) & ~8191;
    sc_base = (surfcache_t *)malloc(size);
	sc_rover = sc_base;

A rover sc_rover was placed at the beginning on the block to keep track of what had been consumed so far. When the rover reached the end of the memory is was wrapping around, essentially recycling old surfaces. The amount of memory reserved can be seen in this graph:

Here is the way a new chunk of memory is allocated from the block:

    memLoc = (int)&((surfcache_t *)0)->data[size];     // Skipping size+ enough space for the memory chunk header.
    memLoc = (memLoc + 3) & ~3;                         /FCS: Round up to a multiple of 4
    sc_rover = (surfcache_t *)( (byte *)new + size);

Notice: Fast cache alignment trick (May have to go in memory system)
Notice: A header is placed on top of the requested amount of memory. Very bizarre line, using a NULL (((surfcache_t *)0)) pointer (but it is ok since it is not deferenced).

Perspective Projection of the Poor ?

Several posts on the internet suggest that Quake2 use the "perspective projection of the poor" with a simple formula and no homogeneous coordinates or matrices (following code from R_ClipAndDrawPoly):

    XscreenSpace = X / Z * XScale
    YscreenSpace = Y / Z * YScale

Where XScale and YScale are determined by the Field Of View and the screen aspect ratio.

This perspective projection is actually equivalent to what happens in OpenGL during the GL_PROJECTION + W divide phase:

    Perspective Projection:
                                                 Eye coordinate vector
                                            |      X                              
                                            |      Y
                                            |      Z
                                            |      1
            Perspective Projection Matrix   |
       XScale       0        0        0     |     XClip
          0      YScale      0        0     |     YClip
          0         0        V1       V2    |     ZClip
          0         0        -1       0     |     WClip
     Clip Coordinates:    
     ================                             XClip =  X * XScale
                                                  YClip =  Y * YScale     
                                                  ZClip =  /
                                                  WClip = -Z
     NDC Coordinate with W Divide:
                      XNDC = XClip/WClip =  X * XScale / -Z
                      YNDC = YClip/WClip =  Y * YScale / -Z                

First naive proof: Compare superposed screenshots. If we look at the code Projection of the poor ? NOPE !!

R_DrawEntitiesOnList: Sprites and Alias Entities

At this point in the rendition process, the map has been rendered to screen:

The engine also generated a 16bits z-buffer (that was written to but never read) in the process:

Note : We notice that close value are "brighter" (which is contrary of OpenGL where closer is "darker"). This is because 1/Z instead of Z is stored in the Z-buffer.

The 16bits z-buffer is stored starting at pointer d_pzbuffer :

   short *d_pzbuffer;

As stated previously 1/Z is stored using the direct application of the formula described in Michael Absrash's article "Consider the Alternatives: Quake's Hidden-Surface Removal". It can be found in D_DrawZSpans:

    zi = d_ziorigin + dv*d_zistepv + du*d_zistepu;

If you are interested in the mathematical proof that you can indeed interpolate 1/Z, here is paper by Kok-Lim Low:

The z-buffer outputed during the map rendition phase is now used as input so entities can be clipped properly properly.
A few things about animated entities (players and enemies):

In terms of lighting:

Translucency in R_DrawAlphaSurfaces

Translucency had to be done with palette indexes. I must have wrote this 10 times on this page but it only express how incredibly AWESOME I think it is.

Translucent polygon are rendered as follow:

After this, if the surface is not 100% opaque, it has to be blended in the offscreen framebuffer:

The trick is achieved using the second part of the pics/colormap.pcx image as lookup to blend a source pixel from the surface cache with a destination pixel (offscreen framebuffer):

The source input generates the X coordinate and the destination the Y coordinate. Resulting pixel is written to the offscreen framebuffer:

Next is an animation illustrating "Before" and "After" a per pixel palette blending:

R_CalcPalette: Post effect operations and Gamma correction

Not only the engine was capable of "per pixel palette blending" and "palette based color gradient selection", it could also modify its entire palette in order to indicate damages or items pick up:

If any "thinker" in the game DLL on the server side wanted to have a color blended at the end of the rendition process it only had to set the value in the RGBA variable float player_state_t.blend[4] for any player in the game. This value was transmitted over the network, copied in refdef.blend[4]'s and passed to the renderer DLL (what a transit!). Upon detection it was blended with each 256s RGB entries in the palette index. After gamma corection, the palette was then re-uploaded to the graphic card.

R_CalcPalette in r_main.c :

    // newcolor = color * alpha + blend * (1 - alpha)
    alpha = r_newrefdef.blend[3];
    premult[0] = r_newrefdef.blend[0]*alpha*255;
    premult[1] = r_newrefdef.blend[1]*alpha*255;
    premult[2] = r_newrefdef.blend[2]*alpha*255;

    one_minus_alpha = (1.0 - alpha);

    in = (byte *)d_8to24table;
    out = palette[0];
    for (i=0 ; i<256 ; i++, in+=4, out+=4)
        for (j=0 ; j<3 ; j++)
            v = premult[j] + one_minus_alpha * in[j];


Trivia : After the palette is altered via the method previously mentionned it has to be gamma corrected (in R_GammaCorrectAndSetPalette):

Gamma correction is an expensive operation, notably involving a call to pow and a division....and has to be perform for each R,G and B channel of a color entry !:

      int newValue = 255 * pow ( (value+0.5)/255.5 , gammaFactor ) + 0.5;


In total it is 3 calls to pow, 3 divisions, 6 additions and 3 multiplication for each 256 entries in the palette index: That's A LOT.
But since the input set is limited to 8 bits per channel, the full correction can be precomputed and cached in a tiny array of 256 entries:

    void Draw_BuildGammaTable (void)
        int     i, inf;
        float   g;

        g = vid_gamma->value;

        if (g == 1.0)
            for (i=0 ; i<256 ; i++)
                sw_state.gammatable[i] = i;
        for (i=0 ; i<256 ; i++)
            inf = 255 * pow ( (i+0.5)/255.5 , g ) + 0.5;
            if (inf < 0)
                inf = 0;
            if (inf > 255)
                inf = 255;
            sw_state.gammatable[i] = inf;

Hence a lookup table is used to do the trick (sw_state.gammatable) and greatly speedup the gamma correction process.

    void R_GammaCorrectAndSetPalette( const unsigned char *palette )
        int i;

        for ( i = 0; i < 256; i++ )
            sw_state.currentpalette[i*4+0] = sw_state.gammatable[palette[i*4+0]];
            sw_state.currentpalette[i*4+1] = sw_state.gammatable[palette[i*4+1]];
            sw_state.currentpalette[i*4+2] = sw_state.gammatable[palette[i*4+2]];

        SWimp_SetPalette( sw_state.currentpalette );


Note : You may think that LCD flat screen don't have the same gamma issue as old CRT...except that they are usually designed to mimic CRT behavior :/ !

Code Statistics

A little bit of Code analysis by Cloc in order to close with the software renderer shows a total of 14,874 lines of code for this module. It is a little bit more than 10% of the total line count but this is not representative of the overall effort since several design were tested before settling with this one.

    $ cloc ref_soft/
          39 text files.
          38 unique files.
           4 files ignored. v 1.53  T=0.5 s (68.0 files/s, 44420.0 lines/s)
    Language                     files          blank        comment           code
    C                               17           2459           2341           8976
    Assembly                         9           1020            880           3849
    C/C++ Header                     7            343            293           2047
    Teamcenter def                   1              0              0              2
    SUM:                            34           3822           3514          14874

The assembly optimization in the nine r_*.asm accounts for 25% of the codebase, that a pretty impressive proportion and I think quite representative of the effort required by the software renderer:
Most of the rasterization routine were hand optimized for x86 processor by Michael Abrash. Most of the Pentium optimization discussed in his "Graphics Programming Black Book" can be seen in action in those files.

Trivia : Some method names are the same in the book and Quake2 code (i.e: ScanEdges).


I tried different profilers, all of them integrated to Visual Studio 2008:

Sticking to time sampling provided VERY different results. As an example: Vtune picked up the cost of the RAM to VRAM transfer (BitBlit) but the two other missed it.
Instrumentation failed on both Intel and AMD's profiler (I am not masochist enough to investigate why) but the Profiler from VS 2008 Team edition succeeded...although I would not recommend it: The game ran at 3 frames per seconds and a 20 seconds session took 1h to analyse !

Profiling with VS 2008 Team edition :

Result is self-explanatory:

Looking closer at the ref_soft.dll time consumption:

Profiling with Intel's VTune :

A few things noticable:

And even more complete profiling of Quake2 ref_sof with VTune.

Profiling with AMD Code Analysis:

Kernel Here and ref_sof here.

Texture filtering

I have received many questions asking for directions for improving the texture filtering (to bilinear or dithered similar to Unreal (mirror)). If you want to play with this the place to go is D_DrawSpans16 in ref_soft/r_scan.c:

The starting screenspace coordinate (X,Y) are in pspan->u and pspan->v, you also have the width of the span in spancount to you can calculate what is the target screen pixel about to be generated.

As for the texture coordinate: s and t are initialized at the texture original coordinates and increased by (respectively) sstep and tstep to navigate the texture sampling.

Some people such as "Szilard Biro" have had good results applying Unreal I dithering technique:

You cand find the source code of the dithered software renderer in my Quake2 fork on github. The dithering is activated by setting the cvar sw_texfilt to 1.

Original dithering from Unreal 1 software renderer:



Fabien Sanglard @2011