By Fabien Sanglard
May 3rd, 2023
driver | |||
cpp* | cc | ld | loader |
In this chapter, we no longer focus on the compiler driver. Instead we take a look at the first stage of the compilation pipeline, the preprocessor.
The goal of the preprocessor is to ingest one source file to resolve all its header files dependencies, resolve all macros, and output a translation unit that will be consumed by the compiler in the next stage. The preprocessor usually takes care of a source file (.c
/.cc
/.cpp
/.m
/.mm
) but it is language agnostic. It can process anything, even text files, as long as it detects "directives" (commands starting with #
character).
cpp
or -E
?
In the early days, the preprocessor was a separate executable called cpp
(C PreProcessor). Lucky us, developers have maintained it all these years, and we can still invoke it.
$ cpp hello.c -o hello.tu
Just kidding. Using verbose mode shows that it is once again the compiler driver which uses argv[0]
to detect it should invoke itself with -E
parameter to behave like a preprocessor.
$ cpp -v hello.c -o hello.tu clang -cc1 -E -o hello.tu hello.c
.i
, .ii
, .mi
, or even .mii
. For simplicity, we always use .tu
.
A lot of work is done by cpp
! So much in fact that the preprocessor is a bottleneck in big projects. Just to get an idea, see how the six lines in hello.c become a behemoth 748 lines translation unit.
$ wc -l hello.c 6 hello.c
$ cpp hello.c > hello.tu $ wc -l hello.tu 748 hello.tu
cpp
. C++ modules should solve this problem. They are not available yet but we should have them soon, right after hell freezes over.
Let's look inside a translation unit.
$ cpp hello.c > hello.tu $ cat hello.tu # 328 "/usr/include/stdio.h" 3 4 extern int printf (const char *__restrict __format, ...); ... // many hundred more lines # 2 "hello.c" 2 main() { printf("hello, world\n"); }
Each fragment of code is preceded by a comment # linenum filename flags
allowing to backtrack which file it came from. This allows the compiler to issue error messages with accurate line numbers.
All directives aimed at the preprocessor are prefixed with #
. The names are usually self-explanatory. You can among many features include files, declare macro, perform conditional compilation.
#include <iostream> #define System S s;s #define public #define static #define void int #define main(x) main() struct F{void println(char* s){std::cout << s << std::endl;}}; struct S{F out;}; public static void main(String[] args) { System.out.println("Hello World!"); }
A handy feature of the pre-processor is the ability to receive command-line parameters to define values via the -D
flag. Let's build a modified hello world.
// defined_return_code.c
int main() {
return RETURN_VALUE;
}
Let's compile it while defining RETURN_VALUE
with -DRETURN_VALUE=3
$ clang -DRETURN_VALUE=3 defined_return_code.c $ ./a.out ; echo $? 3
struct
inheritance.
/* VLC_COMMON_MEMBERS : members common to all basic vlc objects */ #define VLC_COMMON_MEMBERS \ const char *psz_object_type; \ \ /* Messages header */ \ char *psz_header; \ int i_flags; \ \ struct libvlc_int_t { VLC_COMMON_MEMBERS /* Everything Else */ }
When C was created, in the 70s, memory was severely limited. So much so that it constrained compilers to emit instructions after a single pass over a source file. To achieve one pass emission, the language designers pushed the constraint on the programmer.
All functions and variables must be declared before using them. Their definition could come later.
int mul(int x, int y); // This is a declaration
int mul(int x, int y) { return x * y;} // This is a definition (and also a declaration)
extern int i; // This is a declaration
int i = 0 ; // This is a definition (and also a declaration)
class A; // This is a (forward) declaration
class A {}; // This is a definition (and also a declaration)
Let's see what happens when we disregard this constraint.
// var_err.c int main() { return v; } int v = 0;
In this example, despite variable v
being defined three lines later, the compiler will emit and error when main
attempts to use it.
$ clang var_err.c
var_err.c:2:10: error: use of undeclared identifier 'v'
In the case of function invocation, the compiler not only needs to know the return type but also the parameters a function expects (a.k.a its signature). It does not matter if the actual method body (definition) comes after as long as the parameters and their types are known when the callsite must be issued.
// bad_fibonacci.c int fibonacci(int n) { if (n <= 1) return n; return fibonacci(n - 1) + fibonacci(n - 2); }
$ clang -c bad_fibonacci.c
bad_fibonacci.c:4:12: error: implicit declaration of function 'fibonacci' is invalid in C99
bad_fibonacci
did not declare the function fibonacci
before using it which resulted in an error.
// good_fibonacci.c
int fibonacci(int n);
int fibonacci(int n) {
if (n <= 1)
return n;
return fibonacci(n - 1) + fibonacci(n - 2);
}
Adding the declaration allows the compiler to work in one pass.
$ clang -c good_fibonacci.c // It worked
The rule of definition before usage is simple but inconvenient to follow. Sometimes it is plain impossible when two functions call each other.
int function1(int x) { if (x) return function2(2); return 1; } int function2(int x) { if (x) return function1(2); return 2; }
The solution is to adopt a convention where declarations are put in header files and the definitions in the source files. This way, programmers are free to organize their source code as they please.
// foo.h int mul(int x, int y); int sub(int x, int y); |
// bar.h int div(int x, int y); int add(int x, int y); |
// foo.c #include "foo.h" #include "bar.h" int mul(int x, int y) { int c = x; while(y--) // mul with add! c = add(1, c); return c; } int sub(int x, int y) { return x - y; } |
// bar.c #include "bar.h" #include "foo.h" int div(int x, int y) { int c = x; while(y--) // div with sub! x = sub(x, 1); return c; } int add(int x, int y) { return x + y; } |
We can verify that this technique makes sense by looking at cpp
outputs.
$ cpp foo.c > foo.tu $ cpp bar.c > bar.tu
Notice how all comments have been removed and the macros resolved. All that remains is pure code. And of course, all functions are declared before being used.
// foo.tu int mul(int x, int y); int add(int x, int y); int div(int x, int y); int sub(int x, int y); int mul(int x, int y) { int c = x; while(y--) c = add(1, c); return c; } int add(int x, int y) { return x + y; } |
// bar.tu int div(int x, int y); int sub(int x, int y); int mul(int x, int y); int add(int x, int y); int div(int x, int y) { while(y--) x = sub(x, 1); return c; } int sub(int x, int y) { return x - y; } |
So far, it looks like the header system works well. Each source file becomes a translation unit with all declarations at the top. But this technique has a flaw if a header ends up being included more than once. Let's take the example of a mini game-engine project.
// engine.h struct World { }; |
// ai.h #include "engine.h" void think(World& world); |
// render.h #include "engine.h" void render(World& world); |
// engine.cc #include "engine.h" #include "render.h" #include "ai.h" void hostFrame(World& world) { think(world); render(world); } |
// ai.cc #include "ai.h" void think(World& world) { } |
// render.cc #include "render.h" void render(World &worldv) { } |
If we attempt to generate each object file, the sub-systems ai.cc
and render.cc
compile fine but engine.cc
throws an error.
$ clang -c -o render.o render.cc $ clang -c -o ai.o ai.cc $ clang -c -o engine.o engine.cc In file included from engine.cc:4: In file included from ./render.h:3: ./engine.h:3:8: error: redefinition of 'World' struct World { ^ engine.cc:3:10: note: './engine.h' included multiple times, additional include site here #include "engine.h" ^ ./render.h:3:10: note: './engine.h' included multiple times, additional include site here #include "engine.h" ^ ./engine.h:3:8: note: unguarded header; consider using #ifdef guards or #pragma once struct World { ^ In file included from engine.cc:5: In file included from ./ai.h:3: ./engine.h:3:8: error: redefinition of 'World' struct World { ^ engine.cc:3:10: note: './engine.h' included multiple times, additional include site here #include "engine.h" ^ ./ai.h:3:10: note: './engine.h' included multiple times, additional include site here #include "engine.h" ^ ./engine.h:3:8: note: unguarded header; consider using #ifdef guards or #pragma once struct World { ^ 2 errors generated.
Inspecting the resulting TUs with cpp
shows the problem.
$ cpp engine.cc struct World { }; struct World { }; void render(World& world); struct World { }; void think(World& world); void hostFrame(World& world) { think(world); render(world); } |
$ cpp ai.cc struct World { }; void think(World& world); void think(World& world) { } |
$ cpp render.cc struct World { }; void render(World& world); void render(World &world) { } |
engine.cc
includes engine.h
. However engine.cc
also includes ai.h
which in turns also includes engine.h
. In the final cpp
ed translation unit, engine.h
is included three times and the struct World
is declared three times as well.
The solution to multiple import and import cycles is to use include guards or pragma
guard. The difference between the two is that pragma is not part of the standard (although widely supported).
// engine.h
#pragma once // Pragma guard
struct World {
};
|
// ai.h #ifndef AI.H // Header guard #define AI.H #include "engine.h" void think(World& world); #endif // AI.H |
// render.h #ifndef RENDERER.H // Header guard #define RENDERER.H #include "engine.h" void render(World& world); #endif // RENDERER.H |
// engine.cc #include "engine.h" #include "render.h" #include "ai.h" void hostFrame(World& world) { think(world); render(world); } |
// ai.cc #include "ai.h" void think(World& world) { } |
// render.cc #include "render.h" void render(World &worldv) { } |
Since we now prevent multiple header inclusipon in the same TU, we can compile the whole project.
$ clang -c -o render.o render.cc $ clang -c -o ai.o ai.cc $ clang -c -o engine.o engine.cc $
As we alluded earlier while wc
ing the outputs of cpp
, the volume resulting from #include
is huge. It is even worse in C++ where hello.cc
6 lines turned into 44,065 lines, a whopping 7,344% increase.
It is a non-trivial amount of work to parse all this text, even with modern Threadripper CPUs. Build time can be reduced by using pre-compiled headers.
// all_header.h #include "engine.h" #include "ai.h" #include "render.h"
Precompiled headers are super header containing all other headers and stored in binary form.
$ clang -cc1 all_header.h -emit-pch -o all_header.pch
With this approach, the source code does not need to #include
anything anymore.
// engine.cc void hostFrame(World& world) { think(world); render(world); } |
// ai.cc void think(World& world) { } |
// render.cc void render(World &world) { } |
Compiling requires to give the path to the precompiled header to the driver.
$ clang -v -include-pch all_header.pch -c render.cc ai.cc engine.cc
By default, the preprocessor first attempts to locate the target of #include
directives in the same directory as the source file. If that fails, the preprocessor goes though the "header search path". Let's take the example of a simple hello_world.c
project which uses an include for the string value to printf
// hello_with_include.c #include "stdio.h" #include "hello_with_include.h" int main() { printf(MESSAGE); return 0; } |
// include_folder/hello_with_include.h #define MESSAGE "Hello World!\n" |
If the header is not in the same directory, it cannot be found by the pre-processor. We get an error.
$ find . hello_with_include.c include_folder/hello_with_include.h $ clang hello_with_include.c hello_with_include.c:2:10: fatal error: 'hello_with_include.h' file not found #include "hello_with_include.h" ^~~~~~~~~~~~~~~~~~~~~~
The algorithm to lookup the header search path is quite elaborated but well described on GNU gcc cpp documentation page. To fix our example, we can add a directory to the path to search via -I
.
$ clang -Iinclude_folder hello_with_include.c
There are many more flags that can be passed to the driver to impact the header search path. Among them, -sysroot
, -iquote
, -isystem
which impacts vary whether an #include
directive uses quotes "
or angled brackets (<
>
). There is even a -sysroot
parameter which defines a whole toolchain featuring both the header search path and the library linker search path.
Some headers are not provided via flags. These come with the compiler and are automatically added to the header search path. It is the case of stddef.h
which provides among other things size_t
definition and the NULL
macro.
$ cat someheader.h #include "stddef.h" $ gcc -E someheader.h # 1 "/usr/lib/gcc/aarch64-linux-gnu/11/include/stddef.h" 1 3 4 $ clang -E someheader.h # 1 "/usr/lib/llvm-14/lib/clang/14.0.0/include/stddef.h" 1 3 $
Build system compile source files over and over again. An obvious optimization is to re-use outputs from previous runs if they did not change. But because of the header search path and the undeclared preprocessor dependencies, it is hard to build a reliable dependency graph.
This problem can be solved by asking the pre-processor which files were accessed while preprocessing. With clang
for example, the flag -MD
requests the preprocessor to output the dependency required to generate a translation unit. A further option
instructs to write the output to a file. Note that these options can be fed to the compiler driver which will forward them to the pre-processor.
-MF
$ cat hello.c
#include <stdio.h>
int main() {
printf("Hello, world!\n");
return 0;
}
$ clang -MD -MF hello.d -c -o hello.o hello.c
$ cat hello.d
hello.o: hello.c /usr/include/stdio.h /usr/include/_types.h
/usr/include/sys/_types.h /usr/include/sys/cdefs.h
/usr/include/machine/_types.h /usr/include/i386/_types.h
/usr/include/secure/_stdio.h /usr/include/secure/_common.h
ninja
leverages these dependencies outputs to create a dependency graph. After the first compilation, the dependency graph is parsed. On each subsequent build, file modification timestamps are checked to re-build only what has changed.
cpp
will not warn programmers if they neglect to keep their headers tidy. As much as possible, try to:
The following header system exhibits the problems when these two rules are not followed.
// utils.h
#pragma once
#include <stdlib.h>
char* getBuffer(int size);
|
|
// main.c #include "utils.h" #include <stdio.h> int main() { char* buffer = getBuffer(10); // do stuff ... free(buffer); return 0; } |
// utils.c char* getBuffer(int size) { return (char*)calloc(size); } |
Header utils.h
includes stdlib.h
but it is wasteful. All source files including utils.h
now have to also include stdlib.h
. Moreover, the fact that utils.c
uses calloc
is an implementation detail. Let's refactor this line.
// util.h #pragma once char* getBuffer(int size); |
|
// main.c #include "utils.h" #include <stdio.h> int main() { char* buffer = getBuffer(10); // do stuff ... free(buffer); return 0; } |
// util.c
#include <stdlib.h>
char* getBuffer(int size) {
return (char*)calloc(size);
}
|
Moving stdlib.h
include from the .h
file to the .c
file keeps the implementation details private. Let's see what happens when we try to compile.
$ clang -o main main.c utils.c
main.c:2:3: warning: implicit declaration of function 'free' is invalid in C99 [-Wimplicit-function-declaration]
1 warning generated.
/usr/bin/ld: /tmp/main-3866ce.o: in function `main':
main.c:(.text+0x20): undefined reference to `free'
The problem is that main.c
has a transitive dependency on stdlib.h
. The program compiled because utils.h
included it. As soon as it was removed, the translation unit originating from main.c
fails to compile. The solution is to make all source files self-reliant without transitive header dependencies.
// b.h #pragma once char* getBuffer(int size); |
|
// main.c
#include "utils.h"
#include <stdlib.h>
#include <stdio.h>
int main() {
char* buffer = getBuffer(10);
// do stuff ...
free(buffer);
return 0;
}
|
// b.c
#include <stdlib.h>
char* getBuffer(int size) {
return (char*)calloc(size);
}
|
Modern IDEs automatically suggest a header to include if it is missing. This is a two edged sword because the project may become tied to a specific library without the programmer noticing. There are many C libraries (libc, STL versions, POSIX, Windows ...) and it is a good idea to know which header belongs to what.
It is especially important if you are developing cross-platform. Header unistd.h
for example, which defines POSIX functions, does not exist on Windows.