Fabien Sanglard's non-blog
dEngine: A iOS 3D renderer source code
April 28th, 2011Source Code Released
I've decided to release the source code of the OpenGS ES 1.0/2.0 renderers I wrote in the summer of 2009 nicknamed "dEngine". It was the first renderer to feature Shadow Mapping and Bump Mapping on iPhone at the time. Note that shadow mapping was achieved by packing the depth informations in a color texture but now you have access to
GL_OES_depth_texture so you should be able to gain some speed.
The overall code quality is far from exceptionnal (support for two object types WTF ?!?)but I consider it a good tutorial for OpenGL ES 2.0, you can read about bump-mapping and shadow-mapping with a fun example from a Doom 3 level.
The OpenGL ES 2.0 renderer feature uber-shaders: Depending on the material currently renderered a shader is recompiled on the fly in order to avoid branching.
Enjoy:
A few videos
The following videos show characters from Doom3 that I used to test the engine, the HellKnight is 2200 poly, the rest of the room visible is 1000.
The materials all feature a diffuse map, a normal map and a specular map ( up to 512x512 ). The shadow is generated via a shadowmap ( because render to depth texture is not supported on iPhone (GL_OES_depth_texture), depth value are packed in a R8G8B8A8 color texture twice the size of the screen).
|
iPHone 3GS programmable pipeline, running at 27fps. |
iPhone 2G/3G fixed pipeline, running at 45fps. |
Polymorphism
The rendering path is abstracted via a C struct containing function pointers a la Quake 2.
typedef struc renderer_t
{
uchar type;
void (*Set3D)(void);
void (*StartRendition )(void);
void (*StopRendition )(void);
void (*SetTexture)(unsigned int);
void (*RenderEntities)(void);
void (*UpLoadTextureToGpu)(texture_t* texture);
void (*Set2D)(void);
//...
} renderer_t
// renderer_fixed.h
void initFixedRenderer(renderer_t* renderer);
// renderer_progr.h
void initProgrRenderer(renderer_t* renderer);
The "implementation" of every function is hidden in the .c of each renderer, initFixedRenderer and initProgrRenderer only expose the function address via the pointer.
A few optimizations...
Texture compression is a big win as a 32bits per texel RGBA textures is a pig with no real reason to exist when working with a small display. OpenGS ES 1.1 and 2.0 do not require the GPU to support any texture compression but the good guys at Imagination Technologies provided support for PVRTC which bring down consumption to as low at 2bits per pixel with alpha support !
Vertex metadatas can be slimmed down as well:
A "regular" vertex is:
Vertex Elementary unit: position 3 floats normal 3 floats tangent 3 floats textureCoo 2 floats ------------------- 44 bytes
By packing the components in "shorts" instead of "floats" via normalization, you end up having:
Vertex Elementary unit: position 3 floats normal 3 shorts tangent 3 shorts textureCoo 2 shorts ------------------- 28 bytes
It's almost like we "compress" the data on the CPU, send it to the GPU where they are "decompressed". Abusing normalization divide bandwidth consumption by almost 50% and help to slightly improve performances.
Compiler tuning is also important. Xcode is setup by default to generate ARM binaries using the Thumb instruction set, which is 16 bits instead of 32 bits. This reduce the size of the binary and the cost for Apple but it's bad for 3D as Thumb instruction have to be translated to 32bits.
Uncheck this option for an instant gain of performances.
Framebuffer refresh can also be improved a lot with 3.1 firmware. This is an issue I mentionned in my article about Wolfenstein for iPhone: NSTimer is an abomination and I was trilled to find we can now use CADisplayLink to perform vsync and get adaptative framerate ( although I'm experimenting some nasty touchesMoved on non 2G v3.X devices, if you have any info about this, email me !).
Reduze Framebuffer colorspace is an other way to improve performances by reducing the amount of written data. Move from 24bits color space to 16 bits provides some good improvements.
CAEAGLLayer *eaglLayer = (CAEAGLLayer *)self.layer;
eaglLayer.opaque = YES;
eaglLayer.drawableProperties = [NSDictionary dictionaryWithObjectsAndKeys:
[NSNumber numberWithBool:YES],
kEAGLDrawablePropertyRetainedBacking,
//FTW
//kEAGLColorFormatRGBA8,
kEAGLColorFormatRGB565,
kEAGLDrawablePropertyColorFormat, nil];
Stating the obvious here, but reduce texture & blending mode switches are very important ( Forget about good perf if you do more than 60 textures changes). The material approach of the engine can very handy in this regard.
Reduce blending of your polygons is PARAMOUNT: PowerVR performs TBDR (tile-based deferred rendering) which mean that one pixel is rendered only once via hidden surface removal, blending is defeating the purpose. My take is that a blended polygon is rendere regardless of the culling outcome and it destroys perfs.
And last but not least, optimize the vertice indices so GPU fetches will hit the cache as much as possible.
Uber shader
Depending on the materials properties used in a scene , the shader is re-compiled at runtime and then cached. This approach allow to reduce branching operation in the shader. I was very pleased with the result, if I stay below 10/15 shader switches per frame there is no significant performance drop.
//snipet of the fragment shader
#ifdef BUMP_MAPPING
bump = texture2D(s_bumpMap, v_texcoord).rgb * 2.0 - 1.0;
lamberFactor = max(0.0,dot(lightVec, bump) );
specularFactor = max(0.0,pow(dot(halfVec,bump),materialShininess)) ;
#else
lamberFactor = max(0.0,dot(lightVec, v_normal) );
specularFactor = max(0.0,pow(dot(halfVec,v_normal),materialShininess)) ;
#endif
#ifdef SPEC_MAPPING
vec3 matTextColor = texture2D(s_specularMap, v_texcoord).rgb;
#else
vec3 matTextColor = matColorSpecular;
#endif
The now obsolete depth packing into a color buffer.
I love shadows effects, I think the realism and ambiance you get totally justify the cycles and bandwidth cost. It doesn't come for free in openGL and it's quite ugly to do with the fixed pipeline but I was trilled to have it working on mobile shaders. Unfortunatly as of today, iPhones don't support GL_OES_depth_texture, which mean you cannot render directly to the a depth texture. The workaround is to pack a 32 floating point value into 4x4 bytes color (RGBA) texture:
// This is the shadowmap generator shader
const vec4 packFactors = vec4( 256.0 * 256.0 * 256.0,256.0 * 256.0,256.0,1.0);
const vec4 bitMask = vec4(0.0,1.0/256.0,1.0/256.0,1.0/256.0);
void main(void)
{
float normalizedDistance = position.z / position.w;
normalizedDistance = (normalizedDistance + 1.0) / 2.0;
vec4 packedValue = vec4(fract(packFactors*normalizedDistance));
packedValue -= packedValue.xxyz * bitMask;
gl_FragColor = packedValue;
}
This method to pack float in bytes is pretty clever (not mine) because it accounts for the internal accuracy of any GPU ( via the substraction line) and hence can be used on any kind of GPU (PowerVR,ATI,NVidia). Gratz to however came up with this.
Add a comment
Comments (18)
When you're talking about the Ubershader, it is not a real Ubershader because you don't use "if" instructions. So, you need one glUseProgram per material. We can't batch many drawCalls with your code, but sort by material and do one drawCall per material.
Ellis,
Now depending on the drivers you can get some weird behaviour of you use uniform for branching (like shader recompilation). By using #defines and caching I was sure to avoid this.
-t
your posts are always great, and this contribution is valuable. Thank you!
Daniele.
Thank you.
First of all, the way your engine constructed is really charming; simple, robust, and elegant. ...I wish I can write something of that quality myself.
Just a question. You said somewhere you don't use neon/vfp b/c they don't yield much of performance gain at the end. Last year, I tried to build a render batch for gles 1.1 which contains all the visible polygons, and found that massaging the polygons in the batch with simd yield some performance gain. ( haven't tested with 2.0 )
You said dEngine is for SHUMP, but after seeing your video in the previous post, I'm curious if you have a plan to implement simd to squeeze more power out of your engine?
Thanks ;) !
As for the SIMD part my engine was not CPU bound but GPU bound so that's why I didn't see much improvement during my testing. During your tests the CPU must have been the bottleneck. Bottom line: If you have the time and energy to implement SIMD you should obviously do it....and benchmark ;) !
Is dEngine available under a commercial license or LGPL type of license? I would like to use the engine to create applications for Android and other devices.
Thank you,
Jamey
Le license is GPL: if you use the source code you will have to publish the source code of your game.
If you can to license under different terms we can discuss depending on your project.
Fabien
I admit i am a super bad programmer but i would appreciate an answer to these questions! Also do you have any knowledge of where i could learn to program some NEON, when i add more models according to instruments it seems the application is cpu bound with the skinning taking up alot of time, so i wanted to try vertex skinning or using NEON to skin objects (I read that unity managed to speed up skinning by x4 and that was over vertex shader skinning) just for the fun!
Thanks very much and sorry for the questions, i tried using google to find answers but i am losing faith in google :(!!!
Hello Rufus,
1. I am unsure I understand the question. Interleaved vertices are faster because they take advantage of the ARM RAM pre-caching: Everytime a float/int is read from RAM to a register, the neighboring values are also injected in the L1/L2 cache. Since the GPU will process vertices one by one, requesting for each the normal/color/texture coordinate, the moment the vertex 3D spacial coordinate will be access it will also populate L1/L2 with the rest of the vertex attributes (normal,texture coordinate,...). If separate arrays were used instead, the moment the vertex 3D special coordinate will be accessed the L1/L2 will be populated with other vertext 3D spacial coordinate...which is not what the GPU will use right after: The RAM pre-caching is not used in this case.
2. Macro indeed guaranty inlining.
3.You can find a lot of reference to ARM/NEON programming (with a C library all ready to use) at http://monkeystylegames.com:
http://monkeystylegames.com/?p=89
I hope I replied to everything.
Fabien
Just click on the blue XCode icon :) !