Doom3 BFG Source Code Review: Multi-threading (Part 2 of 4) >>
Doom III engine was written from 2000 to 2004, at a time where most PC had only one CPU. Even though idTech4 architecture and design accounted for SMP machines, it ended up being a mono-threaded game at last minute (see my Q&A with John Carmack). Since then, things have changed a lot. This is well summarized by Microsoft article "Coding For Multiple Cores" :
For years the performance of processors has increased steadily, and games and other programs have reaped the benefits of this increasing power without having to do anything special. The rules have changed. The performance of single processor cores is now increasing very slowly, if at all. However, the computing power available in a typical computer or console continues to grow. The difference is that most of this performance gain now comes from having multiple processor cores in a single machine, often in a single chip. The increases in available processing power are just as dramatic as in the past, but now developers have to write multi-threaded code in order to use this power.
The target platforms of Doom III BFG all feature multiple cores:
- Xbox 360 hardware features one triple-core Xenon CPU. With simultaneous multi-threading the platform accounts for 6 logical cores.
- PS3 hardware features one PPE and six SPEs.
- PC hardware commonly features quad-core. With hyper-threading those platforms account for 8 logical cores.
As a result, idTech4 has been boosted not only with multiple threads but also with idTech5 "Job Processing System" which harness multi-core machines.
Trivia : The Xbox One and PS4 hardware specs have been recently announced: Both will feature eight cores. An other strong indicator that in order to be relevant any game developer will have to be good at multi-threading programming.
Doom 3 BFG Threading Model
Doom3 BFG is heavily multi-threaded. On PC the game starts with three threads:
- Thread dedicated to the Renderer Backend (Sends draw commands to GPU).
- Thread dedicated to Game Logic and Renderer Frontend (Generates draw commands).
- Thread dedicated to High frequency (250Hz) joystick sampling.
Additionally, idTech4 spawns two Worker Threads. They are here to assist any of the three Sovereign threads. They are managed by an in-house Scheduler when possible.
Core idea
id Software solution to the multi-core challenge was made public in the 2009 presentation "Beyond Programming Shaders". The two core concepts are:
- Decompose processing task as "jobs" that are performed by "workers" threads.
- Avoid delegating synchronization to the operating system: Do it yourself with atomic operations.
Building blocks
The system relies on three items:
- Jobs
- Workers
- Synchronization
A "Job" is exactly what one would expect:
struct job_t { void (* function )(void *); // Job instructions void * data; // Job parameters int executed; // Job end marker...Not used. };
Note : According to the comments in the code, "A job should be at least a couple of 1000 clock cycles in
order to outweigh any job switching overhead. On the other
hand a job should consume no more than a couple of
100,000 clock cycles to maintain a good load balance over
multiple processing units."
A "Worker" is a thread that will remain idle waiting for a signal. When it is awoken it tries to find jobs. The workers try to avoid synchronization by
using atomic operations while trying to fetch a job from a JobList.
"Synchronization" is performed via three primitives: Signals, Mutexes and Atomic operations. The latter are favored
since they allow the engine to retain CPU focus. Those three mechanisms implementation are detailed at the bottom of this page.
Architecture
The brain of that subSystem is the ParalleleJobManager
. It is responsible for spawning the workers threads and creating queues where Jobs are stored.
That is the first way synchronization is avoided: Divide the engine job posting system into multiple sections that are
accessed by one thread only and therefore require no synchronization. In the engine, queues are called idParallelJobList
In Doom III BFG only three sections are present:
On PC, two Worker Threads are created at startup but
probably more are created on XBox360 and PS3.
According to the 2009 presentation, idTech5 features many more sections:
- Collision detection
- Animation blend
- Obstacle avoidance
- Virtual texturing
- Transparency processing (foliage, particles)
- Cloth simulation
- Water surface simulation
- Detail model generation (rocks, pebbles etc.)
Note : The powerpoint also mentions the concept of one frame latency but this portion of the code is not in Doom III BFG.
Jobs consumption :
Workers run continuously and try to "find a job". This process requires no mutexes or monitors: An atomically incremented integer
distribute jobs with no overlaps.
Usage
Since jobs are segregated into sections accessed by only one thread, there is no synchronization required for adding a job. However, submitting a job to the worker system does involve a mutex. Here is a example where the renderer tries to find which lights are generating interactions :
//tr.frontEndJobList is a idParallelJobList object. for ( viewLight_t * vLight = tr.viewDef->viewLights; vLight != NULL; vLight = vLight->next ) { tr.frontEndJobList->AddJob( (jobRun_t)R_AddSingleLight, vLight ); } tr.frontEndJobList->Submit(); tr.frontEndJobList->Wait();
Three parts:
- AddJob : No synchronization necessary, job is added to a vector.
- Submit : Mutex synchonization, each worker threads add the JobList to their own local ringer buffer list of JobLists.
- Wait : Signal synchonization (delegated to OS). Let the Worker threads complete.
How a Worker works
Workers are infinite loops. Each iteration the loop check if more JobList have been added to the ring buffer and if so copies the reference to the local stack.
Local stack : The thread stack is used to store JobLists addresses as an anti-stalling mechanism. If a thread fails to "lock" a JobList, it falls in RUN_STALLED
mode. This stall can be recovered from by navigating the local stack JobLists list in order to find an other jobList to visit . This way, "Yielding" is avoided.
The interesting part is that everything is is done with no Mutexes mechanisms: Only atomic operations are used.
Note : Avoiding Mutexes is pushed far: Sometimes they are no used even though they should have been for "correctness". Example: The copy from heap to stack uses lastJobList
and firstJobList
with no mutex. This mean that the copy can omit a JobList added concurrently on the ring buffer. It is wrong but it is ok: The worker will just stall and wait for a signal when the ring buffer operation is completed.
The infinite loop :
int idJobThread::Run() { threadJobListState_t threadJobListState[MAX_JOBLISTS]; while ( !IsTerminating() ) { int currentJobList = 0; // fetch any new job lists and add them to the local list in threadJobListState {} if ( lastStalledJobList < 0 ) // find the job list with the highest priority else // try to hide the stall with a job from a list that has equal or higher priority currentJobList = X; // try running one or more jobs from the current job list int result = threadJobListState[currentJobList].jobList->RunJobs( threadNum, threadJobListState[currentJobList], singleJob ); // Analyze how job running went if ( ( result & idParallelJobList_Threads::RUN_DONE ) != 0 ) { // done with this job list so remove it from the local list (threadJobListState[currentJobList]) } else if ( ( result & idParallelJobList_Threads::RUN_STALLED ) != 0 ) { lastStalledJobList = currentJobList; } else { lastStalledJobList = -1; } } }
Where jobs are run :
int idParallelJobList::RunJobs( unsigned int threadNum, threadJobListState_t & state, bool singleJob ) { // try to lock to fetch a new job if ( fetchLock.Increment() == 1 ) { // grab a new job state.nextJobIndex = currentJob.Increment() - 1; // release the fetch lock fetchLock.Decrement(); } else { // release the fetch lock fetchLock.Decrement(); // another thread is fetching right now so consider stalled return ( result | RUN_STALLED ); } // Run job jobList[state.nextJobIndex].function( jobList[state.nextJobIndex].data ); // if at the end of the job list we're done if ( state.nextJobIndex >= jobList.Num() ) { return ( result | RUN_DONE ); } return ( result | RUN_PROGRESS ); }
id Software Synchronization tools
id Software uses three types of synchronization mecanisms:
1. Monitors (idSysSignal) :
Abstraction | Operations | Implemented with | Details |
idSysSignal | Event Objects | ||
Raise | SetEvent | Sets the specified event object to the signaled state. | |
Clear | ResetEvent | Sets the specified event object to the nonsignaled state. | |
Wait | WaitForSingleObject | Waits until the specified object is in the signaled state or the time-out interval elapses. |
Signals are mainly used to put a Thread to sleep. Workers uses idSysSignal.Wait()
to remove themselves from the Operating System scheduler until jobs are available.
2. Mutexes (idSysMutex) :
Abstraction | Operations | Implemented with | Details |
idSysMutex | Critical Section Objects | ||
Lock | EnterCriticalSection | Waits for ownership of the specified critical section object. The function returns when the calling thread is granted ownership. | |
Unlock | LeaveCriticalSection | Releases ownership of the specified critical section object. | |
3. Atomic operations (idSysInterlockedInteger) :
Abstraction | Operations | Implemented with | Details |
idSysInterlockedInteger | Interlocked Variables | ||
Increment | InterlockedIncrementAcquire | Increments (increases by one) the value of the specified 32-bit variable as an atomic operation. The operation is performed using acquire memory ordering semantics. | |
Decrement | InterlockedDecrementRelease | Decrements (decreases by one) the value of the specified 32-bit variable as an atomic operation. The operation is performed with release memory ordering semantics. |