CS232 Compilers, Assemblers, and Linkers

Craig Zilles, some figures from Computer Organization & Design
Now that we know what assembly language is all about, and we've looked at how code primitives (e.g., loops, if, functions) from a high-level language (HLL) (e.g., C) can be mapped to assembly, we'll elaborate a little on the process of compilation.

As shown below, the process of compilation from a HLL is a three step process. First the HLL is "compiled" to assembly, then the assembly is "assembled" into object files, and finally the object files are "linked" into an executable.

We can tell a compiler to start and stop at any of these points. Assume we have a C file named "something.c":

g++ something.c (compiles, assembles, and links to produce an executable a.out)
g++ -S something.c (produces an assembly file something.s)
g++ -c something.c (produces an object file something.o)
g++ something.s (assembles and links to produce executable a.out)
g++ something.o (links to produce executable a.out)

As we are somewhat familiar with the compiler and the assembler at this point, we'll focus on the linker. The linker enables separate compilation. As is seen in the next figure, an executable can be made up of a number of source files which can be compiled and assembled independently. The linker is responsible for putting those versions together. This has a number of advantages, including: 1) it enables distributing libraries in binary form (i.e., no source), including dynamically linked libraries (DLLs) and when you change your program you only have to recompile the file that was changed.

Because the various object files will include references to each others code and/or data, these will need to be stitched up during link time. For example in the figure below, the object file that has main includes calls to functions "sub" and "printf". After concatenating all of the object files together, the linker uses the "relocation records" to find all of the addresses that need two be filled in.

Since assembling to machine code removes all traces of labels from the code, the object file format has to keep these around in a different place; the symbol table is a list of names and their corresponding offsets in the text and data segments. An abstract UNIX object file format is shown below.

A disassembler provides support for translating back from an object file or executable.

objdump -disassemble a.out (displays assembly for program a.out)

Loaders

Before we can run an executable, we first have to load it into memory. This is done by the loader, which is generally part of the operating system. The loader does the following things:

Allocates memory for the program's execution.
Copies the text and data segments from the executable into memory.
Copies program arguments (e.g., command line arguments) onto the stack.
Initializes registers: sets $sp to point to top of stack, clears the rest.
Jumps to start routine, which: 1) copies main's arguments off of the stack, and 2) jumps to main.

MIPS specifies how our programs will get laid out in memory. The layout, consisting of 3 segments (text, data, and stack), is shown below. The "dynamic data" segment is also referred to as the "heap", the place dynamically allocated memory (from "malloc" and "new") comes from. This organization enables any division of the dynamically allocated memory between the heap and the stack. This explains why the stack grows downward.

Structs

In previous lectures, we talked about how arrays get laid out. Structures (and objects for that matter) are very similar, except they can consist of elements of different sizes. In the example below, note how padding is inserted to ensure alignment of the various components. Since in C the structure has to be laid out in the order specified in the code, bad orderings (like the one below) can lead to unnecessary padding. How would you reorder the fields to eliminate the padding?

Why's and How's of Assembly Programming

We write code in assembly in CS 232 to get an understanding of what is really going on at the machine level, not because it is a good substitute for high level languages. Actually, quite the opposite. Most programmers don't write assembly code on a regular basis, for a number of reasons, including:

Readability: HLL's are clearer than assembly.
Portability: Assembly code is ISA specific.
Productivity: One line of HLL code often takes many lines of assembly, and programmers can write a fixed number of lines of code a day.

That said, there are some good reasons to write code in assembly in the real world, including:

Expressiveness: there are some things that cannot be expressed in HLL (e.g., I/O, accesses to special registers).
Performance/code size: Humans can out code compilers in certain circumstances.

Typically, though, when assembly code is used, the whole application will not be written in assembly. Instead, a program will mostly be written in an HLL with a few calls to functions written in assembly. This makes sense for performance, because empirically most programs spend around 90% of their time in 10% of the code (the old 90/10 rule). Thus, there is little point coding the other 90% of the code, which only contributes to 10% of the performance, in assembly. Typically, it is only the "kernels" of multimedia programs (for example) that are hand coded in assembly.

In addition to linking your HLL code to functions written in assembly, some compilers provide support for "inline assembly". The __asm "function" is interpreted by the compiler, which substitutes the registers that it assigns for the variables in the HLL (e.g., a, b, and ret_val) for the place holders (e.g., %0, %1, %2) the supplied assembly code.

int
add(int a, int b) { /* return a + b */
   int ret_val;
   __asm("add %2, %0, %1", a, b, ret_val);
   return(ret_val); 
}