February 23, 2012

SHMUP on Android: A tale of an eventful port

Six months ago I released the source code of "SHMUP": a modest indie 3D shoot'em up designed for iOS. Since it did honorably on the Apple Appstore I offered anyone to port it to Android Appmarket for a 50/50 revenues split. Two developers consecutively took on the challenge only to give up later.

So I downloaded the Native Development Kit from Google and I did it myself. I completed the port this weekend and even released a free version.

The codebase runs on Windows, iOS, MacOS X and Android in one click. You can download the code on GitHub.


Android app on Google Play


Original version

SHMUP engine was designed for iPhone 1 in 2009. Given the constraints in terms of CPU/GPU limitations and the mandatory 60 frame per second framerate I came up with a polygon codec architecture that reduced bandwidth consumption to the minimum and granted instant visible surface determination:


Abstraction

Growing up reading id Software code was beneficial. It influenced me to have an abstraction layer between the engine and any I/O. In Shmup the following were abstracted:

Before starting the Android port this is how the kernel was linked for the three supported platforms:



For the Android version I was able to re-use the OpenGL ES renderer and the libpng texture loader but I had to write a new sound system based on OpenSL ES and a new file system module (on Android assets are packed in a zip file: They can only be accessed via android/asset_manager.h).


Heap corruption

Two years running on thousands of various iOS devices lead me to think that SHMUP engine was stable and bug free. But Landon Dyer emphasize it in one of his post:

"The dirty truth of software is: It's buggy. [..] Most of the time we don't see them.".

A program that does not crash or does not exhibit unwanted behavior is not necessarily exempt of faults. It is only if the fault leads to an error that we call it a "bug". When I compiled and ran the code on Android it crashed randomly: Because the compiler and loader behaved differently the faults were now errors.

Corrupting the heap means that your code at some point during execution writes in a zone it was not supposed to. This is nasty because the program will only start behaving erratically when the zone you corrupted is used...usually tens of thousand of instructions later.

SHMUP was corrupting the heap for years on thousands of devices around the world but now it was a real problem.


Since the engine was running on Windows I had access to one of the best heap corruption tracker in the industry: Application Verifier. One of its great feature is pageheap.exe. PageHeap is detailed on Microsoft webpage but in Full-page heap it will change the way malloc/calloc behave: One virtual memory page will be used for each allocation and the end of the block will be aligned with the end of the page.


Hence the following code:

			
    void* block1 = malloc(1024);
    void* block2 = malloc(1024);
    void* block3 = malloc(1024);
			
			

would have normally resulted in the following memory layout: All variables are in the same virtual page:




With this regular design if a pointer starts writing in block2 but overflow and writes in block3 there is no way to track it: Heap is corrupted. But with pagehead.exe the layout looks like this:



The benefit of this new design is that writing past a block allocated via malloc will raise a page fault interruption. This page fault can be detect and an interrupt 3 can be raised. This will trigged the debugger to stop the program execution and point immediately to the faulty code.

PageHeap brought two sections of code to my attention:


The first one was in the parser while reading the configuration file:

	
    LE_readToken();			
					
    if (!strcmp("impactTextureName", LE_getCurrentToken()))
    {
	    
	    LE_readToken();
	    
	    explosionTexture.path = calloc(strlen(LE_getCurrentToken()+1), sizeof(char));
	    
	    strcpy(explosionTexture.path, LE_getCurrentToken());
    }

			
			


The second one was in the event system:


			
    event_t *event ;						
    
    event = calloc(1, sizeof(event));
    
    event->type = EV_REQUEST_SCENE;
    
    event->time = simulationTime + 5000;
			
	
			

Both are typos that static code analysis (llvm and PVSStudio) failed to report. Sometimes the simplest part of the code can be at fault.


Bad char

According to the C specifications char can be either signed or unsigned depending on the platform and the compiler. Most systems running on x86, Linux, Windows and MacOX S will alias char with signed char. ARM (iPhone/Android) and PowerPC systems usually make char equivalent to unsigned char.

XCode seems to have a special flag that makes char signed on ARM iOS in order to match x86 MacOS X but the NDK compiler will make them unsigned !

This caused issue in the following code that is called when a player died:

			
			
    typedef struct player_s{
		
        char numLives;
		
    } player_t		
			
    
			
    P_KillPlayer(player_t* player)
    {
	    player->numLives --;
	    
	    if (player->numLives < 0)
	    {
		    Game_Terminate();
		    return;
	    }
	    
    }			
			
    
			

On Android numLives never reached -1 but instead wrapped around to 255. The graphic renderer never expected such a high value and the game was crashing from within the GPU drivers....


Bottom line is that you should never ever use char: Always be specific and use either signed char or unsigned char.


Sound system with OpenSL ES

The only descent API available for the sound system is OpenSL ES and it is starting with Android 2.3.3 (currently 65% of Android market according to Google statistics ). OpenAL is not available and probably never will (OpenMAX AL released with Android 4.0 has nothing to do with OpenAL).
It will get the job done but the design of the API is surprising to say the least. They decided to have an OOP approach using a non-OPP language and the resulting code looks as follow:


Initialize the OpenSL ES engine. Notice the pointer to pointer to a struct containing a function pointer....



    SLObjectItf engineObject;
    slCreateEngine(&engineObject, 0, NULL, 0, NULL, NULL);


    SLEngineItf engineInterface;
    (*engineObject)->Realize(engineObject, SL_BOOLEAN_FALSE);
    (*engineObject)->GetInterface(engineObject, SL_IID_ENGINE, &engineInterface);

    
				



How to play a sound. Again pointer to pointer to struct containing a function pointer. Every aspect of the player object are controlled via various interfaces (SL_IID_BUFFERQUEUE,SL_IID_PLAY)



    SLDataFormat_PCM pFormat = [...];
    SLDataLocator_AndroidSimpleBufferQueue pLocator = {SL_DATALOCATOR_ANDROIDSIMPLEBUFFERQUEUE, 2};
    SLDataSource audioSource = {&pLocator, &pFormat};

    SLDataSink audioSink = [..] ;
    SLObjectItf player;

    (*engineInterface)->CreateAudioPlayer(engineInterface,&player , &audioSource, &audioSink,0, NULL,NULL);
    (*player)->Realize(player, SL_BOOLEAN_FALSE);

    
    SLBufferQueueItf bufferQueueItf;
    (*player)->GetInterface(player, SL_IID_BUFFERQUEUE, &bufferQueueItf)
    (*bufferQueueItf)->Enqueue(bufferQueueItf, context->pDataBase, sizeToEnqueue);


    SLPlayItf playerInterface;
    (*player)->GetInterface(player, SL_IID_PLAY, &playerInterface);
    (*playerInterface)->SetPlayState(playerInterface,SL_PLAYSTATE_PLAYING);


				

My take is that if you design an OOP API you should use an OOP language such as C++, not C.


JNI Call from a NativeActivity

Even though you can go full C/C++ with the NDK there are a few things that cannot be done (like open the android browser with a specific URL). In this case you need to go back to the Virtual Machine...and it is a little bit tricky !

When creating the Native Activity you get a JVIEnv* and a JavaVM* pointers. This is supposed to allow you to call any method from C/C++ to the Dalvik virtual machine. In practice it takes a lot more efforts than that because:

  1. The NDK thread is not attached to the Virtual Machine.
  2. Even if you attach it to the VM, the classloader has no knowledge of your JAVA classes and package.


Here is what has to be done in order to make a JNI call from a native activity:



    
    
    jmethodID findMethod(ANativeActivity* activity, char& methodName, char* methodSignature){

        JavaVM* vm = activity->vm;
        JNIEnv *jni;

        (**vm).AttachCurrentThread ( activity->vm , &jni , NULL ) ;

        jclass activityClass = (*jni)->FindClass(jni,"android/app/NativeActivity");
	
        jmethodID getClassLoader = (*jni)->GetMethodID(jni,activityClass,"getClassLoader", "()Ljava/lang/ClassLoader;");
	
        jobject cls = (*jni)->CallObjectMethod(jni,activity->clazz, getClassLoader);

        jclass classLoader = (*jni)->FindClass(jni,"java/lang/ClassLoader");

        jmethodID findClass = (*jni)->GetMethodID(jni,classLoader, "loadClass", "(Ljava/lang/String;)Ljava/lang/Class;");

        jstring strClassName = (*jni)->NewStringUTF(jni,className);

        jobject activityClass = (jclass)(*jni)->CallObjectMethod(jni,cls, findClass, strClassName);
	
        return (*jni)->GetStaticMethodID(jni, activityClass, methodName, methodSignature);
	
    }



I hope the next release of the NDK will have JNIEnv attached by default and a classloader aware of the Android PAK Java classes. Thanks to Dennis Forbes from TewDew Software for the trick.

EDIT : Martins Mozeiko pointed out that there is an easier way:


    jmethodID findMethod(ANativeActivity* activity, char& methodName, char* methodSignature)
    {
        JavaVM* vm = activity->vm;
        JNIEnv *jni;

        (**vm).AttachCurrentThread ( activity->vm , &jni , NULL ) ;

        jclass activityClass = (*jni)->GetObjectClass(jni, activity->clazz);
        return (*jni)->GetStaticMethodID(jni, activityClass, methodName, methodSignature);
    }


Recompile goes undetected

The NDK is a command-line tool which mean that we use a makefile to compile the C/C++ code into a shared library but then Eclipse to deploy. The problem is that most of the time Eclipse won't detect the C/C++ code has been recompiled. Even when configuring Eclipse to use native hook: Preferences > General > Workspace > "Refresh using native hooks").

This was frustrating and a best approach would be to build and install the APK from command-line. I would love to see a built-in tool to do that in the next NDK release.


Overall feeling

No question about it: Android NDK will get the job done. There is still space for improvements but overall it is a very nice framework to work with.

The real big issue is the Android wall of fragmentation:

As a result it is very hard to generate an unified user experience.



Recommended readings



 

@