How Compiler build Software

Yesterday I refrehsed myself about various source file types and how to get them built by tools respectively. In this post I will summary my study note.

In this post, I will use C as sample language as C is higher level than assembly language, it is closer with OS than Java hence it is a good one to be the example.

Compiler for C,

There are various C compilers and the most famous compiler should be GNU Compiler Collection. The GNU Compiler Collection (GCC) is a compiler system produced by the GNU Project. GCC is a key component of the GNU toolchain (in other word, GCC toolchain method to compile codes). The GNU toolchain is a blanket term for a collection of programming tools produced by the GNU Project. In GCC, it consists by below components:

1. C preprocessor – The C preprocessor implements the macro languages used to transform C, C++ and other programs before they are compiled.

2. C compiler – The C compiler compiles source codes into assembly language.

3. assembler – The assembler compiles assembly language into target file (binary code).

4. linker – The linker links target files into a single executable program.

More details about GNU GCC, http://en.wikipedia.org/wiki/GNU_Compiler_Collection

In the following examples, I will show you how to compile C codes via GCC. My demo platform is,

luhuang@luhuang-VirtualBox:~/workspace/Hello$ uname -a
Linux luhuang-VirtualBox 3.0.0-32-generic-pae #51-Ubuntu SMP Thu Mar 21 16:09:48 UTC 2013 i686 i686 i386 GNU/Linux

Source code:

Let’s see our material firstly:

main.c

#include "hello.h"

int main(int argc, char *argv[]){
	if (MAX(1,2) == 2){
		hello("Hello!");
	}
	return 0;
}

hello.c

#include <stdio.h>
#include "hello.h"

void hello(const char *string)
{
	printf("Greeting %s\n", string);
}

hello.h

extern void hello(const char *string);

#define MAX(a,b) ((a) > (b) ? (a) : (b))

Let’s see what above three files will do:

1. In main.c, its first line tells C compiler to include ‘hello.h’.

2. In hello.h, it defines a Marco ‘MAX’ and in main.c it invokes the MAX macro.

3. In hello.c, it includes two header file. The first header comes from C’s built-in stdio.h and it provides the standard printf function.

4. In hello.h, it defines the function prototype of hello(*) and marco MAX(a,b).

Ok, let’s compile them!

1. let’s go to source dir,

luhuang@luhuang-VirtualBox:~/workspace/Hello$ ls
hello.c hello.h main.c

You can see it has three source codes I described above only.
2. Compile it. option -c means compiling source codes into target file. Actually in the back-end, it invokes C preprocessor, C compiler and assembler in sequence to compile source codes into target file. In C language, a basic unit of compiling is a C source code file (.c) and its header file ends with .h. Similarly, its name of target file will be end with .o. For a header file, it won’t generate any .o file.

luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -c hello.c
luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -c main.c
luhuang@luhuang-VirtualBox:~/workspace/Hello$ ls
hello.c  hello.h  hello.o  main.c  main.o

3. If you want to invoke preprocessor explicitly. You can use option -E. With option -E, C preprocessor will just process source codes’ #include directives and Marco. -E will tell GCC processes only #include directives and marco. It won’t do any compile work. In the following example you can see, it replaces

	if (MAX(1,2) == 2){

with

if (((1) > (2) ? (1) : (2)) == 2){

Let’s see how it preprocesses main.c,
luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -E main.c

# 1 "main.c"
# 1 ""
# 1 ""
# 1 "main.c"
# 1 "hello.h" 1
extern void hello(const char *string);
# 2 "main.c" 2

int main(int argc, char *argv[]){
if (((1) > (2) ? (1) : (2)) == 2){
hello("Hello!");
}
return 0;
}

4. Let’s see how GCC compile source codes into assembly code with option -S. As I said above, GCC works in the way of toolchain. That is to say, if you invoke -S, it will do preprocessor -E firslty. Let’s see below example,

luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -S hello.c
luhuang@luhuang-VirtualBox:~/workspace/Hello$ cat hello.s
	.file	"hello.c"
	.section	.rodata
.LC0:
	.string	"Greeting %s\n"
	.text
	.globl	hello
	.type	hello, @function
hello:
.LFB0:
	.cfi_startproc
	pushl	%ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	movl	%esp, %ebp
	.cfi_def_cfa_register 5
	subl	$24, %esp
	movl	$.LC0, %eax
	movl	8(%ebp), %edx
	movl	%edx, 4(%esp)
	movl	%eax, (%esp)
	call	printf
	leave
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc
.LFE0:
	.size	hello, .-hello
	.ident	"GCC: (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1"
	.section	.note.GNU-stack,"",@progbits
luhuang@luhuang-VirtualBox:~/workspace/Hello$

5. After Step 4, we get source code’s assembly code. Let’s move further to generate its target file. To generate target code, we can use option -c:

luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -c hello.c
luhuang@luhuang-VirtualBox:~/workspace/Hello$ file hello.o
hello.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
luhuang@luhuang-VirtualBox:~/workspace/Hello$

Here I use Linux’s file command to see hello.o’s file type.  You can see, gcc -c generates its target code in the format of 32-bit, Least Significant Byte, Intel x86. (Yes, the target code could not run cross multi-platform 😦 )

The other way to check target file is, to check what methods it invokes. Here we use the nm command to get those information. nm command is very useful when we need to find out ‘undefined symbol’ build error. In the following example, you can see hello.o invokes hello() and printf() methods. It matches with the source code.

luhuang@luhuang-VirtualBox:~/workspace/Hello$ nm hello.o
00000000 T hello
         U printf
luhuang@luhuang-VirtualBox:~/workspace/Hello$

Unix also provides another command objdump to help us retrieve detailed information about a target file. In the following information, it will use -x option to get hello.o’s abstract information:

luhuang@luhuang-VirtualBox:~/workspace/Hello$ objdump -x hello.o

hello.o:     file format elf32-i386
hello.o
architecture: i386, flags 0x00000011:
HAS_RELOC, HAS_SYMS
start address 0x00000000

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         0000001c  00000000  00000000  00000034  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000050  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000050  2**2
                  ALLOC
  3 .rodata       0000000d  00000000  00000000  00000050  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  4 .comment      0000002b  00000000  00000000  0000005d  2**0
                  CONTENTS, READONLY
  5 .note.GNU-stack 00000000  00000000  00000000  00000088  2**0
                  CONTENTS, READONLY
  6 .eh_frame     00000038  00000000  00000000  00000088  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
SYMBOL TABLE:
00000000 l    df *ABS*	00000000 hello.c
00000000 l    d  .text	00000000 .text
00000000 l    d  .data	00000000 .data
00000000 l    d  .bss	00000000 .bss
00000000 l    d  .rodata	00000000 .rodata
00000000 l    d  .note.GNU-stack	00000000 .note.GNU-stack
00000000 l    d  .eh_frame	00000000 .eh_frame
00000000 l    d  .comment	00000000 .comment
00000000 g     F .text	0000001c hello
00000000         *UND*	00000000 printf

6. Ok. Now, we know how GCC compiler preprocessor, compile, and assembly source codes into target files. Let’s generate executable from these target files. Here we use option -o (the toolchain here is: -E, -S, -c, -o):

luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -o hello hello.o main.o
luhuang@luhuang-VirtualBox:~/workspace/Hello$ file hello
hello: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, not stripped

You might note its file information ‘dynamically linked (uses shared libs)’, it means it is using dynamic linked method.

7. Run it.

luhuang@luhuang-VirtualBox:~/workspace/Hello$ ./hello
Greeting Hello!

ldd – print shared library dependencies

luhuang@luhuang-VirtualBox:~/workspace/Hello$ ldd hello
	linux-gate.so.1 =>  (0xb7796000)
	libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb7603000)
	/lib/ld-linux.so.2 (0xb7797000)

libc.so.6 is standard C library which provides functions like printf.

8. In Linux, it supports Static linked library and Dynamic linked library. In the following let’s see how a static/dynamic linked library works:

How Static linked library work:

1. gcc -c hello.c will compile hello.c to hello.o

2. ar -rs will archive hello.o as static library

3. ar -t will list what .o files have been archived

4. gcc -c main.c will compile main.c to main.o

5. gcc -o hello main.o will fail with complain ‘undefined reference to `hello”.

6. gcc -o hello main.o myhello.a compiles it with myhello.a. It works.

7. show dynamic dependencies. it doesn’t list myhello.a as it has been compiled into the executable itself.

luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -c hello.c
luhuang@luhuang-VirtualBox:~/workspace/Hello$ ar -rs myhello.a hello.o
ar: creating myhello.a
luhuang@luhuang-VirtualBox:~/workspace/Hello$ ar -t myhello.a
hello.o
luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -c main.c
luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -o hello main.o
main.o: In function `main':
main.c:(.text+0x11): undefined reference to `hello'
collect2: ld returned 1 exit status
luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -o hello main.o myhello.a
luhuang@luhuang-VirtualBox:~/workspace/Hello$ ldd hello
	linux-gate.so.1 =>  (0xb7786000)
	libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb75f3000)
	/lib/ld-linux.so.2 (0xb7787000)
luhuang@luhuang-VirtualBox:~/workspace/Hello$

How Dynamic linked library work:

1. use PIC (position-independent code) directive to compile hello.c. That will enable program be loaded into memorry in a dynamic manner.

2. use -shared directive to archive hello.o to myhellolib.so

3. show information about myhellolib.so. It is a shared object.

4. gcc -c main.c to generate main.o

5. generate executable. -L specify the directory of shared object. Here . means current directory.

6. ldd to show dynamic dependencies. You can see it complains ‘myhellolib.so => not found’

7. Although we can use -L to tell program where to locate program libraries, we still have to tell OS where to load them. In Linux, we can use LD_LIBRARY_PATH to specify the location of program libraries.

luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -c -fPIC hello.c
luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -shared -o myhellolib.so hello.o
luhuang@luhuang-VirtualBox:~/workspace/Hello$ file myhellolib.so
myhellolib.so: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked, not stripped
luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -c main.c
luhuang@luhuang-VirtualBox:~/workspace/Hello$ gcc -o hello main.c -L . myhellolib.so
luhuang@luhuang-VirtualBox:~/workspace/Hello$ ldd hello
	linux-gate.so.1 =>  (0xb778c000)
	myhellolib.so => not found
	libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb75f9000)
	/lib/ld-linux.so.2 (0xb778d000)
luhuang@luhuang-VirtualBox:~/workspace/Hello$ ./hello
./hello: error while loading shared libraries: myhellolib.so: cannot open shared object file: No such file or directory
luhuang@luhuang-VirtualBox:~/workspace/Hello$
luhuang@luhuang-VirtualBox:~/workspace/Hello$ export LD_LIBRARY_PATH=.
luhuang@luhuang-VirtualBox:~/workspace/Hello$ ldd hello
	linux-gate.so.1 =>  (0xb7784000)
	myhellolib.so => ./myhellolib.so (0xb777f000)
	libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb75ee000)
	/lib/ld-linux.so.2 (0xb7785000)
luhuang@luhuang-VirtualBox:~/workspace/Hello$ ./hello
Greeting Hello!
luhuang@luhuang-VirtualBox:~/workspace/Hello$

Summary

Let me summarize how compiler build software. Basically, a compiler needs to do below similar steps to convert source codes into executable:

1. Pre-processor — check errors in language syntax level.

2. Compile it into target file — compile files into binary target files.

3. Link them — link or load them in memory and run.

You can also refer to book Software Build Systems: Principles and Experience for more details and further study.