The story of the 3dfx Voodoo1

April 4, 2019

This article is the second entry in a series about the "3D cards of the late 90s running Quake". The first entry^[1] took a look at the end of 1996 with the Rendition Vérité 1000 and its dedicated port called vQuake. Rendition had managed to beat everybody to Quake-market. For a brief period of time they had the one and only board capable of running id Software's blockbuster with hardware-acceleration.

Everything changed on January 1997, when id Software released a new version of Quake called GLQuake. Because the port was done using miniGL (a subset of OpenGL 1.1 standard), any hardware-accelerator manufacturer could provide miniGL drivers and enter the 3D race. From that point onward, the competition would be open to everybody. The goal was to generate as many frames per second as possible. The rewards were fame and customer's money. A brief search revealed that the two authorities of the time were unambiguous about who was at the top of the mountain.

All in all there's currently hardly any doubt about it, Voodoo rules the Quake world. Now since Quake rules the gaming world, 3Dfx Voodoo is pretty much unavoidable for any gamer.

- Tom's Hardware November 30, 1997

3DFX Voodoo 1
-------------
The benchmark against which everything else is measured.

- John Carmack .plan. Feb 12, 1998^[2]

After a first peek at the specs^[3] claiming a 50 Mpixels/second fillrate figure, I was eager to fully immerse into it and find out what 3dfx had done to produce such an indisputable powerhouse.

3dfx Interactive

Ross Smith, Scott Sellers, and Gary Tarolli originally met while working at SGI^[4]. After a short stint at Pellucid where they tried to sell IrisVision boards for PC (at 1994 $4,000/piece), they started their own company with backing from Gordie Campbell's TechFarm. Headquartered in San Jose, California, 3dfx Interactive was founded in 1994.

Initially intending to design powerful hardware solutions for arcade games, 3dfx pivoted in order to design PC add-on boards. There were three reasons for that.

The price of RAM was low enough.
Starting with FastPage RAM and then with EDO RAM, the latency of RAM had improved 30%. It could now be clocked at up to 50Mhz.
3D (or pseudo-3D) games were becoming more and more popular. Success of titles such as DOOM, Descent, and Wing Commander III showed that a market for 3D acceleration cards was about to emerge.

They figured they could design something powerful, dedicated to games, and retailing in the $300-$400 range. In 1996 the company announced the SST1 (named after the founders Sellers-Smith-Tarolli-1) architecture which was promptly licensed by several OEMs such as Diamond, Canopus, Innovision, and ColorMAX. The marketing name given to their creation was "Voodoo1" for its magic like performance properties.

Like with the V1000, OEM's only leverage on the cards they produced was the RAM they selected (EDO vs DRAM), the color of the resin and the physical layout of the chips. Pretty much everything else was standardized.

Diamond Monster 3D, courtesy of vgamuseum.info.

Canopus Pure3D, courtesy of vgamuseum.info.

BIOSTAR Venus 3D, courtesy of vgamuseum.info.

ORCHID Righteous 3D, courtesy of vgamuseum.info.

What is mesmerizing when taking a look at an SST1 board are the divergences from its competitors, the Rendition Vérité 1000 and the NVidia NV1.

First of all, 3dfx had made the audacious choice to not support 2D rendering. The Voodoo1 had two VGA ports, one acting as output and the other as input. The card was designed as an add-on which took as input the output of the 2D VGA card already installed in a PC. When the user was running the operating system (DOS or Windows), the Voodoo1 was just a pass-through which did nothing but to relay the signal from its VGA input to its VGA output. When entering 3D mode, the Voodoo1 took over the VGA output port and discarded the signal on its VGA input. Some boards had a mechanical switch which would generate an audible "click" when switching between 2D and 3D mode. This choice also meant the card could only do fullscreen rendition, there was no "windowed" mode.

The second remarkable aspect of the SST1 is that it was made of not one CPU but two non-programmable ASICs (Application-Specific Integrated Circuit). If you follow the bus lines, you can see that each chip, labeled "TMU" and "FBI", had its own RAM. On a 4MiB card, the RAM was divided equally with 2MiB for the TMU to store textures and 2MiB for the FBI to store the color and z buffer where values were stored respectively as 16-bit RGBA and 16-bit integer/half-float. A 4MiB card supported a resolution up to 640x480 (2 color buffer (640x480x2) for double-buffering + 1 depth buffer (640x480x2) = 1,843,200). Later models with 4MiB FBI RAM allowed up to 800x600 (2x800x600x2 + 800x600x2 = 2,880,000).

The SST1 Rendering pipeline

The pipeline is not explicitly detailed in the specs. My interpretation is that the life of a triangle is made of five stages.

A triangle is created and transformed on the computer's main CPU (usually a Pentium). These operations include modelview/projection multiplication matrix, culling, per-vertex perspective-divide, homogeneous coordinate clipping, and view port transform. At the end of the process, all what remains are visible screenspace triangles (because of clipping, one triangle can result in two sub-triangles).
Triangles are sent to the Frame Buffer Interface (a.k.a FBI) via a triangleCMD command over the PCI bus. They are converted to scanline requests issued to the Texture Mapping Unit. For each scanline element (called fragment), the TMU performs up to four texture lookups per pixel when bilinear filtering is required by the developer. Per-fragment perspective-divide is also done in the TMU.
The TMU sends fragments to the FBI as textured 16-bit RGBA color value + a 16-bit z-value.
The FBI performs fragment z-buffer tests against its dedicated RAM storing the framebuffer RGBA and z values.
Lastly, a fragment is lit via its color attribute and a 64-entry fog lookup. If blending is requested, the FBI combines the incoming fragment with what was already in the color buffer.

Trivia: If you are a 3D enthusiast you probably know about the fast inverse square root code popularized by Quake 3 source code:

float Q_rsqrt(float number) {
    long i;
    float x2, y;
    const float threehalfs = 1.5f;

    x2 = number * 0.5f;
    y  = number;
    i  = * (long*) &y;    // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );                // what the fuck? 
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );     // 1st iteration
    return y;
}

In his quest^[5] to find out the origin of Q_rsqrt, Rys Sommefeldt got in touch with Gary Tarolli who mentioned having used the code since his days at SGI. It is fair to assume the trick was also used in the SST1 pipeline.

Does not compute

With the pipeline in mind and knowing that every component (TMU, FBI, EDO RAM) was clocked at 50Mhz, it is obvious that the math does not add up to reach 50 MPixel/second. There were two problems to solve here.

First, the TMU unit had to read four texels to perform bilinear filtering on a texture. This means four round trips to the RAM which would have resulted in TMU starvation and 50/4 = 12.5MPixel/s fillrate.

There is a second choke point at the FBI level. If z-buffer testing is enabled, an incoming fragment z value needs to be compared to what is already in the z-buffer before either writing or discarding it. If the test succeed, the value needs to be written. These are two RAM operations which should have resulted in an halved 50/2= 25 MPixel/s.

TMU 4-way interleave

The solution to the four samples problem in the TMU stage is mentioned in the SST1 specs.

The texture memory datapath is fully interleaved, which allows an individual bank to access data irrespective of the address used to access data in other banks.

- SST1 specs

It is not specified if the bus used address multiplexing or if the data and address lines were shared. Drawing it non-multiplex and non-shared makes things easier to understand.

Regardless of the details, the TMU architecture allowed it to retrieve 4 x 16-bit texels per clock. With the inputs coming at the correct rate, the TMU was able to perform per-fragment w-divide then generate a fragment z value (16-bit) and a fragment color (16-bit) which were passed down to the FBI.

To be able to query four banks at a time, texture dimensions had to be multiple of two and no adjacent texels could be stored in the same bank. The following 16x16 smiley shows the texels layout.

FBI 2-way interleave

The solution to the two RAM access in the FBI stage is not explicitly mentioned in the specs. However the document does mention a fillrate of 100 MPixels/second on glClear due to the ability to write two pixels/clock which suggests a 2-way interleave was used there.

The FBI read and wrote pixels two at a time (2 x 1 pixel made of 16-bit color and 16-bit z = 64 bits). To do that, the 21-bit address generates two 20-bit where the least significant bit is discarded to read/write two consecutive pixels. Since the scanline algorithm required to write/read on horizontal lines going from left to right, reading two consecutive pixels at a time worked very well.

TMU->FBI 64-bit bus

The last piece of the puzzle is the 64-bit FBI-TMU bus. There is next to nothing in the specs about this but the behavior can be inferred from what the FBI consumes. Since the FBI worked two-pixels at a time, it is reasonable to assume the TMU did not send texels as fast as possible but instead batched them two at a time as two 16-bit color + 16-bit z-value.

Programming the Voodoo1

At its lowest level, programming the Voodoo1 was done using memory mapped registers. The API is made of a surprisingly low number of five register commands TRIANGLECMD (fixed-point), FTRIANGLECMD (floating-point), NOPCMD (no-op), FASTFILLCMD (buffer clear), and SWAPBUFFERCMD associated with a load of data registers to configure blending, z-test, upload a fog color, and many more. Textures upload to VRAM was done via 8MiB of PCI write-only memory-mapped RAM.

To mitigate pipeline starvation, the card featured a memory backed command FIFO which allowed it to push thousands of commands without waiting for completion. According to John Carmack this mechanism was key to sustained performance^[6].

Programming the Voodoo1 (for real)

For developers, programming the Voodoo1 was done through its Glide API^[7]. The design of the API was logically inspired of IRIS GL/OpenGL, with a state machine and prefixes for everything (except that it used "gr" instead of "gl" and programmers had to manage VRAM like in Vulkan.)

#include <glide.h>

void main( void ) {
   GrHwConfiguration hwconfig;
   grGlideInit(void);
   grSstSelect( 0 );
   grSstQueryHardware(&hwconfig);

   grSstSelect(0);
   grSstWinOpen(null, GR_RESOLUTION_640x480, GR_REFRESH_60HZ, 
     GR_COLORFORMAT_RGBA, GR_ORIGIN_LOWER_LEFT, 2, 0);
   grBufferClear(0, 0, 0);

   GrVertex A, B, C;
   ... // Init A, B, and C.

   guColorCombineFunction( GR_COLORCOMBINE_ITRGB );
   grDrawTriangle(&A, &B, &C);

   grBufferSwap( 1 );

   grGlideShutdown();
}

MiniGL "standard"

Even though MiniGL was a subset of the OpenGL 1.1 standard, there was never a spec for it. MiniGL was "whatever functions Quake uses". Utility objdump running against quake.exe binary makes it easy to build an "official" list.

$ objdump -p glquake.exe | grep " gl"

glAlphaFunc      glDepthMask        glLoadIdentity      glShadeModel
glBegin          glDepthRange       glLoadMatrixf       glTexCoord2f
glBlendFunc      glDisable          glMatrixMode        glTexEnvf
glClear          glDrawBuffer       glOrtho             glTexImage2D
glClearColor     glEnable           glPolygonMode       glTexParameterf
glColor3f        glEnd              glPopMatrix         glTexSubImage2D
glColor3ubv      glFinish           glPushMatrix        glTranslatef
glColor4f        glFrustum          glReadBuffer        glVertex2f
glColor4fv       glGetFloatv        glReadPixels        glVertex3f
glCullFace       glGetString        glRotatef           glVertex3fv
glDepthFunc      glHint             glScalef            glViewport

If you learned OpenGL recently, you may be intrigued by function names such as glColor3f, glTexCoord2f, glVertex3f, glTranslatef, glBegin, and glEnd. These were used for something called "Immediate mode" where vertex coordinates, texture coordinates, matrix manipulation and color were specified one function call at a time.

Here is how one textured and gouraud shaded triangle was drawn "back in the days".

void Render {     
    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

    glEnable(GL_TEXTURE_2D);
    glShadeModel(GL_SMOOTH);
    glBindTexture(GL_TEXTURE_2D, 1);  // Assume a texture was loaded in textureId=1

    glMatrixMode(GL_PROJECTION);
    glLoadIdentity();
    glOrtho(-1.0, 1.0, -1.0, 1.0, -1.0, 1.0);
    glMatrixMode(GL_MODELVIEW);
    glLoadIdentity();
      
    glBegin(GL_TRIANGLES);
      glColor3f(1.0f, 1.0f, 1.0f);
      glTexCoord2f(0.0f, 0.0f);
      glVertex3f(-1.0f,-0.25f,0.0f);

      glColor3f(0.0f, 0.0f, 0.0f);
      glTexCoord2f(1.0f, 0.0f);
      glVertex3f(-0.5f,-0.25f,0.0f);

      glColor3f(0.5f, 0.5f, 0.5f);
      glTexCoord2f(0.0f, 1.0f);
      glVertex3f(-0.75f,0.25f,0.0f);
    glEnd();
}

GLQuake

The theoretical maximum of 50MPixels/second fillrate should have delivered close to 50 frames per second in 640x480^[8]. However because Quake combined two texture layers per surface (one for the color and one for the lightmap), the SST1 had to draw each frame twice with additional blending on the second pass. As a result, combined to a P166Mhz, Quake ran at 26 fps.

Lowering the resolution to 512x384 on the same machine provided a butter smooth 41 fps^[9] which was beyond anything competitors of the time could achieve.

Trivia: The SST1 was not for everybody. Some people loved their pixels and found bilinear filtering "blurry". Some other people were annoyed with the loss of gamma correction.

Glquake looks like shit. I know a few of you may argue here, but really, lets face it - it looks terrible, especially on NVidia cards. Its not too bad on a 3dfx board... the colors are washed out though. On a TNT2 its disgraceful; way too dark and murky.

- @Frib, Unofficial Glquake & QW Guide^[10]

3fdx Voodoo²

To say that 3dfx ruled from 1996 to 1998 would be an understatement. After the SST1, the Voodoo² doubled down on performance with 100Mhz EDO RAM, ASICs clocked at 90MHz, and not one but two TMUs allowing it to draw a Quake multitextured frame (color + lightning) in a single pass^[11]. The thing was a beast and even the hardware looked gorgeous.

The fill-rate of a Voodoo² was nearly doubled to reach 90 MPixel/s. Quake benchmarks took off to an astonishing 80 fps on a Pentium II 266 MMX (compared to 56 fps with a Voodoo1) effectively maxing out both the game logic and display monitors.

Super Voodoo 2 12MB, courtesy of vgamuseum.info.

Unfortunately, the history of 3dfx took a cruel turn with the release of the Voodoo3 in 1999. As it was attempting to build its own all-in-one cards and stopped providing OEMs with its designs, the company had to face increasing competition.

The transition did not go as well as expected and Voodoo3 performance was judged disappointing compared to the GeForce 256 from NVidia which was capable of hardware T&L (the part done by the Pentium in the pipeline).

To counter NVidia, 3dfx canceled the Voodoo4 to work directly on the Voodoo5 featuring a VSA-100 (Voodoo Scalable Architecture). The plan did not go as expected when the "Napalm" as it was nicknamed found itself faced with the more powerful NVidia's GeForce 2 and ATI Radeon cards upon release. Ultimately, on 28 March 2000, 3dfx filed for bankruptcy and was purchased by NVidia.

To whoever lived through the late 90s and had the pleasure to run a Voodoo1 or Voodoo2, 3dfx remains an icon symbolizing excellence. An ode to a much deserved success achieved through audacity, exceptional talent and hard-work. Thanks guys :)!

References

^ [ 1] The story of the Rendition Vérité 1000
^ [ 2] John Carmack .plan. Feb 12, 1998
^ [ 3] SST-1, HIGH PERFORMANCE GRAPHICS ENGINE FOR 3D GAME ACCELERATION
^ [ 4] 3dfx Oral History Panel
^ [ 5] Origin of Quake3's Fast InvSqrt()
^ [ 6] Tweet by John Carmac on Apr 4, 2019
^ [ 7] Glide Programming Guide
^ [ 8] Don't forget to account for the depth buffer clear (3.45ms).
^ [ 9] Comparison of Frame-rates in GLQuake Using Voodoo & Voodoo 2 3D Cards
^ [10] Frib, Unofficial Glquake & QW Guide
^ [11] VOODOO2 GRAPHICS HIGH PERFORMANCE GRAPHICS ENGINE FOR 3D GAME ACCELERATION

^	[ 1]	The story of the Rendition Vérité 1000
^	[ 2]	John Carmack .plan. Feb 12, 1998
^	[ 3]	SST-1, HIGH PERFORMANCE GRAPHICS ENGINE FOR 3D GAME ACCELERATION
^	[ 4]	3dfx Oral History Panel
^	[ 5]	Origin of Quake3's Fast InvSqrt()
^	[ 6]	Tweet by John Carmac on Apr 4, 2019
^	[ 7]	Glide Programming Guide
^	[ 8]	Don't forget to account for the depth buffer clear (3.45ms).
^	[ 9]	Comparison of Frame-rates in GLQuake Using Voodoo & Voodoo 2 3D Cards
^	[10]	Frib, Unofficial Glquake & QW Guide
^	[11]	VOODOO2 GRAPHICS HIGH PERFORMANCE GRAPHICS ENGINE FOR 3D GAME ACCELERATION