FABIEN SANGLARD'S WEBSITE

ABOUT  CONTACT  RSS  GIVE


Mar 04, 2025
Why fastDOOM is fast

During the winter of 2024, I restored an IBM PS/1 486-DX2 66Mhz, "Mini-Tower", model 2168. It was the computer I always wanted as a teenager but could never afford. Words cannot do justice to the joy I felt while working on this machine.

As soon as I got something able to boot, I benchmarked the one software I wanted to run.

C:\DOOM>doom.exe -timedemo demo1
timed 1710 gametics in 2783 realtics  

Doom doesn't give the fps right away. You have to do a bit of math to get the framerate. In this instance, that's 1710/2783*35 = 21.5 fps. An honorable performance for the best machine money could (reasonably) buy in Dec 1993 (specs, chipset, video, disk1, disk2, speedsys).

I was resigned to playing under Ibuprofen until I heard of fastDOOM. I am usually not a fan of ports because they tend to add features without cohesion (except for the dreamy Chocolate DOOM) but I gave it a try out of curiosity.

C:\DOOM>fdoom.exe -timedemo demo1
Timed 1710 gametics in 1988 realtics. FPS: 30.1

30% faster without cutting any features[1]! On a demanding map like doom2's demo1, the gain is even higher, from 16.8 fps to 24.9 fps. That is 48% faster!

I did not suspect that DOOM had left that much on the table. Obviously shipping within one year left little time to optimize. I had to understand how this magic trick happened.

A byte of history

Before digging into fastDOOM, let's understand where the code comes from. DOOM was originally developed on NeXT Workstation. The game was structured to be easy to port with most of the code in a core surrounded by small sub-systems performing I/O.

Source: Game Engine Black Book: DOOM

During development, DOS I/Os were written by id Software. This became the commercial release of DOOM. But that version could not be open sourced in 1997 because it relied on a proprietary sound library called DMX.

What ended up being open sourced was the linux version, cleaned up by Bernd Kreimeier when he was working on a book project to explain the engine.

A DOS version of DOOM was reconstructed by using linux's core, Heretic I/O, and APODMX (Apogee Sound wrapper) to emulate DMX. Because Heretic used video mode 13h while DOOM used video mode Y, the graphic I/O (i_ibm.c) was reverse-engineered from DOOM.EXE disassembly. That is how the community got PCDOOM v2[2].

fastDOOM starting point was PCDOOM v2.

                    ┌───────────────┐                              
                    │ NeXTStep DOOM │                              
                    └─────┬────┬────┘                              
                          │    │                                 
                          │    │                                 
                          │    │                                 
          ┌────────────┐  │    │  ┌──────┐      ┌─────────┐      
          │ Linux DOOM │◄─┘    └─►│ DOOM ├─────►│ Heretic │      
          └──────┬─────┘          └──────┘      └────┬────┘      
                 │                    ⁞              │           
                 │                    ▼              │           
                 │              ┌──────────┐         │           
                 └─────────────►│ PCDOOMv2 │◄────────┘           
                                └─────┬────┘                     
                                      ▼                          
                                ┌──────────┐                     
                                │ fastDOOM │                     
fastDoom genealogy              └──────────┘
──────────────────
The big performance picture

Victor "Viti95" Nieto, wrote release notes to describe the performance improvement of each version but he seemed more interested in making FDOOM.EXE awesome than detailing how he did it.

To get the big picture of performance evolution over time, I downloaded all 52 releases of fastDOOM, PCDOOMv2, and the original DOOM.EXE, wrote a go program to generate a RUN.BAT running -timedemo demo1 on all of them, and mounted it all with mTCP's NETDRIVE.

I chose to timedemo DOOM.WAD with sound on and screen size = 10 (fullscreen with status bar). After several hours of shotguns and imps agony, I had run the whole suite five times and graphed the average fps with chart.js.

The first thing this graph allows to rule out is that fastDOOM improvements were mostly due to using a modern compiler. PCDOOMv2 is built with OpenWatcom 2 but only gets a marginal improvement over DOOM.EXE.

git archeology

On top of releasing often, Viti95 displayed outstanding git discipline where one commit does one thing and each release was tagged. fastDOOM git history is made of 3,042 commits which allows to benchmark each feature.

I wrote another go program to build every single commit. I will pass on the gory details of handling the many build system changes (especially from DOS to Linux). After an hour I had the most ugly program I ever wrote and 3,042 DOOM.EXE. I was pleased to see the build was almost never broken.

Graphing the files size shows that the early effort was to be lean by cleaning and deleting code. There are major drops with bf0e983 (build 239 where sound recording was removed), 5f38323 (build 0340 where error code strings were deleted), and 8b9cac5 (build 1105 where TASM was replaced with NASM).

Going deep

Timedemoing all builds would have taken a very long time (3042x1.5/60/24 * 3 passes = 9 days) so I focused on the release where most of the speed was gained. I wrote yet another go program to generate a .BAT file running timedemo for all commits in v0.1, v0.6, v0.8, v0.9.2, and v0.9.7. I mounted 1.4 GiB of FDOOM.EXE with mTCP and ran it. It took a while because versions with 200+ commit runtime was 8h/pass.

fastDOOM v0.1

This release featured 220 commits.

$ git log --reverse --oneline "0.1" | wc -l
     220
Chart is click-able and mouseover-able

The MPV patch of v0.1 is without a doubt build 36 (e16bab8). The "Crispy optimization" turns status bar percentage rendering into a noop if they have not changed. This prevents rendering to a scrap buffer and blitting to the screen for a total of 2 fps boost. At first I could not believe it. I assume my toolchain had a bug. But cherry-picking this patch on PCDOOMv2 confirmed the tremendous speed gain.

Next is build 167 (a9359d5) which inlines FixedDiv via macro.

Near the end, we see a series of optimizations granting 0.5 fps.

Build 207 (9bd3f20): A PSX Doom optimization which optimizes the way the BSP is traversed.
Build 212 (dc0f48e) "Inlined R_MakeSpans" which renders horizontal surfaces.

Overall this version saw a lot of code being deleted (50% of commits were deletions) which probably helped to cuddle the 486 cachelines of my machine.

git log --reverse --oneline "0.1" | grep -i -E "remove|delete" | wc -l
     100

Somehow, one of my patches made it to fastDOOM. Probably when I was writing the Black Book? I have zero recollection of writing this!

fastDOOM v0.6

This release featured 33 commits.

$ git log --reverse --oneline "0.5"^.."0.6" | wc -l 
      33
Chart is click-able and mouseover-able

Among many small optimizations (hello GbaDOOM 341)) are MVP ones.

Build 342 (22819fd) Skips rendering unneeded visplanes.
Build 359 (40e0d4b) Removes a level of player pointer indirection.
Build 360 (ccd296f) Double down on indirection removal.
Build 369 (f29e665) Inlines the screenspace line splitter.

fastDOOM v0.8

This release featured 282 commits.

$ git log --reverse --oneline "0.7"^.."0.8" | wc -l
     282

The sound system was a bit unstable so I had to timedemo without sound and then normalize the fps. Moreover the focus of v0.8 seems to have been text-mode renderer so two regressions happened at Build 670 (a92c67f) and Build 730 (c3f5f50) where the Crispy optimization went away.

Chart is click-able and mouseover-able

MVPs:

Build 792 (f279b7d): One executable per renderer (FDOOM.EXE, FDOOM13H.EXE, and so on).
Build 793 (1874ee8): Disable debugging for compiler.
Build 796 (6aae724): Bring back Crispy opt.
Build 794 (1366ebf): Compile less code whenever possible.

fastDOOM v0.9.2

This release featured 110 commits.

$ git log --reverse --oneline "0.9.1"^.."0.9.2" | wc -l
     110
Chart is click-able and mouseover-able

MVPs:

Build 1639 (ae2a951): Optimize skyflatnum comparison.
Build 1645 (0730cdc): Optimize R_DrawColumn for Mode Y.
Build 1646 (17c9e83): Cleanup R_DrawSpan code.

fastDOOM v0.9.7

This release featured 293 commits.

$ git log --reverse --oneline "0.9.6"^.."0.9.7" | wc -l
     294

Despite running the benchmark several times, I was unable to reduce the noise of this release.

Chart is click-able and mouseover-able

MVPs:

Build 1941 (0688235): Testing x86 ASM changes.
Build 1943 (f326e73): Add CPU selection + CR2 optimization for 386SX.
Build 1944 (a836abb): Add ESP optimization for R_DrawSpan386SX.
Build 2000 (3432590): Add basecode for rendering fuzz columns in ASM.
Build 2031 (0edab46): Remove a CMP comparison each loop (ken silverman's optimization?).

Mode 13h vs Mode Y

fastDOOM explored many ways to make things faster, for a broad range of CPUs (386, 486, Pentium, Cyrix) and video buses (ISA, VLB, PCI). One optimization that did not work on my machine was to use video mode 13h instead of mode Y.

In mode 13h dispatch of data toward the four VRAM banks of the VGA is done in hardware. To the CPU, the VRAM appears like a single linear 320x200 framebuffer. The inconvenience is that you can't double buffer in VRAM so you have to do it in RAM which means bytes are written twice. First into the framebuffer in RAM. And then a second time when sent to VRAM. Also, the engine must block on VSYNC.

Mode 13h                                                                                               
────────             RAM                   VRAM (VGA card)                 SCREEN              
             ┌───────────────────┐      ┌───────────────────┐       ┌───────────────────┐      
             │ ┌───────────────┐ │      │                   │       │                   │      
             │ │ framebuffer 1 │ │      │                   │       │                   │      
             │ └───────────────┘ │      │                   │       │                   │      
             │ ┌───────────────┐ │      │ ┌───────────────┐ │       │                   │      
    CPU ────►│ │ framebuffer 2 │ ├────► │ │framebuffer(fb)│ ├──────►│                   │      
             │ └───────────────┘ │      │ └───────────────┘ │       │                   │      
             │ ┌───────────────┐ │      │                   │       │                   │      
             │ │ framebuffer 3 │ │      │                   │       │                   │      
             │ └───────────────┘ │      │                   │       │                   │      
             └───────────────────┘      └───────────────────┘       └───────────────────┘      

The mode-Y lets programmers access the VGA banks individually. This allows triple-buffering in VRAM. Moreover, it has the advantage of writing bytes once, directly into VRAM. The target bank must be manually selected by the developer via very slow OUT instructions but that allows to duplicating pixels horizontally (which gives low-detail mode for free) by writing to two VGA banks at once via latches[3]. Another inconvenience is that it makes drawing invisible Specter much slower since it requires reading back from the VRAM.

Mode Y                                                                                               
───────                                    VRAM (VGA card)                 SCREEN         
                                        ┌───────────────────┐       ┌───────────────────┐ 
                                        │ ┌───────────────┐ │       │                   │ 
                                        │ │fb1 | fb2 | fb3│ │       │                   │ 
                                        │ └───────────────┘ │       │                   │ 
                                        │ ┌───────────────┐ │       │                   │ 
                                        │ │fb1 | fb2 | fb3│ │       │                   │ 
    CPU ──────────────────────────────► │ └───────────────┘ ├──────►│                   │ 
                                        │ ┌───────────────┐ │       │                   │ 
                                        │ │fb1 | fb2 | fb3│ │       │                   │ 
                                        │ └───────────────┘ │       │                   │ 
                                        │ ┌───────────────┐ │       │                   │ 
                                        │ │fb1 | fb2 | fb3│ │       │                   │ 
                                        │ └───────────────┘ │       │                   │ 
                                        └───────────────────┘       └───────────────────┘    

For machines with fast CPUs and bus (100+ Mhz/ Pentium and VLB/PCI) where video cards are less likely to handle OUT instruction well, mode 13h is better. For "slow CPUs", it is faster to write data once to VRAM via mode Y.

Anyway, Doom used mode Y.

DOOM uses 320*200*256 VGA mode, which is slightly different from MCGA mode (it would NOT run on an MCGA equiped machine). I access the frame buffer in an interleaved planar mode similar to Michael Abrash's "Mode X", but still at 200 scan lines instead of 240 (less pixels == faster update rate).

DOOM cycles between three display pages. If only two were used, it would have to sync to the VBL to avoid possible display flicker. If you look carefully at a HOM effect, you should see three distinct images being cycled between.

- John Carmack[4] (mirror)

Another reason John game me for using Mode-Y back in the days is that the tools used by the graphic team (Deluxe Paint) only supported 320x200 (whereas Mode-X is 320x240).

e...@agora.rdrop.com (Ed Hurtley) wrote: >Check, please... In case you haven't hit ESC ever, the Options menu >has a Low/High resolution toggle... Low is 320x200, High is >640x400, with the border graphics (the score bar, menu, etc...) are >still 320x200... (Just the same graphics files)

Low detail is 160*200 in the view screen. This is done by setting two bits in the mapmask register whenever the texturing functions are writing to video memory, causing two pixels to be set for each byte written.

ui...@freenet.Victoria.BC.CA (Ben Morris) wrote:

>John,

>You're using a planar graphics system for a bitmapped game that >updates the entire screen at a respectable framrate on a 486/66?


Its planar, but not bit planar (THAT would stink). Pixels 0,4,8 are in plane 0, pixels 1,5,9 are in plane 1, etc.

>That's pretty incredible. I would have thought all the over- >head for programming the VGA registers would kill that >possibility.

The registers don't need to be programed all that much. The map mask register only needs to be set once for each vertical column, and four times for each horizontal row (I step by four pixels in the inner loop to stay on the same plane, then increment the start pixel and move to the next plane).

It is still a lot of grief, and it polutes the program quite a bit, but texture mapping directly to the video memory gives you a fair amount of extra speed (10% - 15%) on most video cards because the video writes are interleaved with main memory accesses and texture calculations, giving the write time to complete without stalling.

Going to that trouble also gets a perfect page flip, rather than the tearing you get with main memory buffering.

- John Carmack[5] (mirror)

Heretic was released in 1994. Hardware had evolved to make mode 13h[6][7] more appealing so Raven modified the DOOM engine to this effect. PCDoom v2 used Heretic I/O but re-implemented the video I/O with mode Y. Finally fastDOOM gives users the choice by providing several executable FDOOM.EXE, FDOOM13H.EXE, and FDOOMVBD.EXE.

The DOOM press release beta (October '93) used Mode 13h, so I assume they switched to Mode Y to improve performance on slower machines (low-detail). I wonder why they didn’t also implement the so-called "potato mode", which writes four pixels with a single 8-bit write to VRAM.

In FastDoom, I reintroduced Mode 13h because Heretic/Hexen had better-optimized ASM rendering code for this mode. Later, I was able to partially port this approach to column rendering in Mode Y, which resulted in a 5% to 7% performance improvement.

Based on my testing, the best mode for 486 CPUs is the VESA direct mode (FDOOMVBD.EXE for 320x200). This mode combines the advantages of Mode Y with the optimized rendering code from Heretic while avoiding any OUT instructions—except for one to switch buffers, which executes only once per rendered frame. The only downside is that it requires a VLB or PCI graphics card with LFB enabled and has slower performance in low-detail and potato-detail modes.
- Conversation with Viti95

Viti95 elaborated further on fastDOOM mode 13h during proof-reading.

In FastDoom, Mode 13h uses a single framebuffer in RAM, which is copied to VRAM after the entire scene is rendered. Vsync is not enforced, which may result in flickering. There are two methods for copying the backbuffer to VRAM, optimized for different bus speeds. For slow buses (8-bit ISA), a differential copy method is used, transferring only modified pixels.

This approach involves many branches but is faster overall because branching is less expensive than excessive bus transfers. For faster buses (16-bit ISA, VLB, PCI, etc.), a full backbuffer copy is performed using REP MOVS instructions, which is efficient when the bus bandwidth is sufficient.
- Conversation with Viti95
More optimization which did not work

Another venue I appreciated seeing explored is OpenWatcom's processor-specific flags (4r/4s vs 3r/3s)[8]. Both wcc386's 386 and 486 flags were attempted but ultimately discontinued because the 386 version always seemed faster.

One of my goals for FastDoom is to switch the compiler from OpenWatcom v2 to DJGPP (GCC), which has been shown to produce faster code with the same source. Alternatively, it would be great if someone could improve OpenWatcom v2 to close the performance gap.
- Conversation with Viti95
Overall impression

What splendid work by Victor Nieto! If software can die from a thousand cuts, Viti95 made fastDOOM awesome with three thousand optimizations! Not only he leveraged existing improvements (crispy, psx, gba, Lee Killough), he also came up with many news one and generated so much hype that even Ken Silverman (author of Duke3D build engine) came to participate[9].

I tip my beret to you Victor!

References

^ [1]Note from Viti95: Joystick and network gameplay support have been removed, so it's not a completely feature-intact port ^^ (People are still trying to convince me to bring network gameplay back).
^ [2]DOOM engine: gamesrc-ver-recreation
^ [3]Game Engine Black Book: Wolfenstein 3D
^ [4]Doom graphics modes usenet
^ [5]Doom graphics modes usenet
^ [6]Doom vs Heretic VGA performance difference
^ [7]Doom in DOS: Original vs Source Ports
^ [8]OpenWatcom documentation
^ [9]Note from Viti95: Some of Ken Silverman’s ideas and code made their way into the rendering functions for UMC Green CPUs, resulting in a significant speed boost on that hardware..


*