Computer architecture is a field with many subareas, most of which are beyond the scope of this course. One part of that field is of some importance to the average programmer: the Instruction Set Architecture, or ISA. ISAs are the interface design between hardware and software and define the set of things that the programmer can ask the hardware to do.
From the software side, an ISA is a set of instructions with two representations: machine code and assembly. Machine code is a compact binary representation of the instructions, assembly is a less-compact textual representation of the same information. The process of converting assembly to machine code is called assembling and is a fully reversible operation; that is, a disassembler can recover the original assembly from the machine code. Because of this reversibility, the distinction between assembly and machine code is not very important in practice and it is fairly common to use the two terms informally as if they were synonyms.
Compilation is the process of converting programming language source code into assembly and is in general a lossy, irreversible operation. For example, both for-loops and while-loops are translated to the same assembly, making recovering the original construct from the assembly guess work at best.
There are multiple ISAs in common use today. As of 2024, the two most prominent families of ISAs are x86-64 (also called amd64) and ARM (which is a collection of several loosely-related ISAs; aarch64 is a popular one). x86-64 is common in devices manufactured by Intel and AMD for servers, workstations, desktops, and notebook computers and has evolved in a backwards-compatible way since 1978. ARM is common in devices manufactured by Qualcomm, Samsung, and Apple for phones, embedded computing, and low-power portable computers. NVidia, AMD, and Intel also make commonly-used ISAs for graphics processors, which are out of scope for this class.
Many years ago a university professor wanted to propose an alternative to the then-dominant x86 ISA and dubbed it a complex instruction set computer
(CISC) as opposed to the reduced instruction set computers
(RISC) which that professor was advocating for. These terms caught on and most non-x86-64 ISAs advertise themselves as being RISC. These terms have become something of a shibboleth for distinguishing computer architecture insiders from computer architecture outsiders and you’re likely to run into them at some point during your career.
As a software developer and hardware user, I have never encountered a situation where the CISC/RISC distinction mattered to me.
In principle, an ISA could be designed in many different ways, but in practice CPU ISAs tend to have the same basic components.
Each instruction is defined by an operation it performs and typically a few operands. Three forms of operand are common:
x86-64 tends to prefer two-operand instructions, like x += y
. ARM tends to prefer three-operand instructions, like x = y + z
. Both have assembly language syntaxes that put the operation first followed by a the list of all operands, like add x0, x1, x2
.
In machine code, the operation is encoded as an enumerated value often called an icode.
Assembly tends to make heavy use of a stack data structure. It’s used to store activation records that manage call and return. It’s also used to temporarily store data that can’t fit in the available program registers. Many functions have the general structure
Most instructions have one of the following operation types:
Load and store instructions are sometimes collectively called move
instructions. x86-64 has instructions that combine a load or store with a compute; ARM does not.
Function calls are a combination of pushing (storing) the address to return to onto the stack and then jumping to the address of the first instruction of the function. Function returns are a combination of popping (loading) the address to return to from the stack and then jumping to that address.
Any instruction might be conditional, checking some condition and only doing its operation if the condition is true. Both x86-64 and ARM store conditions in a special register, often called the flags
or condition codes,
updating them in (some) compute instructions and checking them in subsequent conditional instructions. Thus is is common to see instructions like if negative, jump to address 0x1234
with no explicit indication of what we are checking for negativity.
All ISAs have conditional jumps, as these are needed to implement programming language control constructs; other conditional instructions are also commonly provided and used by compilers to optimize certain common code patterns.
Most ISAs operate as part of a system that allows multiple processors to access the same memory at the same time. For many years, ISAs left synchronizing memory accesses to an operating system, but modern ISAs also tend to have a small number of atomic operations that load, compute, and store with a guarantee that no other chip can modify the memory that was loaded before the store is completed.
Atomic operations are considerably more expensive to implement and execute than other operations, and as such are limited both in the set of atomic operations within the ISA and in when they are used.
The following intentionally do not use any real ISA, but instead describe the kinds of operations that ISAs would use to implement the given code.
C | Instructions |
---|---|
|
|
|
|
|
|
You’ll learn many more such translations in CS 421. Key ideas to remember for now:
There are various reasons why you might be called on to interact with assembly.
Debugging pre-compiled binaries.
Compiled executable are typically delivered without source code but can be disassembled to assembly. This might be necessary if your code is interacting with third-party libraries in unexpected ways. Most debuggers will disassemble automatically; the command-line tool objdump -d executableFile
command will also output the assembly.
Binary security exploits.
Many security exploits work at the machine code level. Common security operations (both offense and defense) work above this level, but those working in security research often need to interact with assembly or even the specific bits of machine code.
Performance tweaks.
Compilers are very good at optimizing your code, but there’s a big gap between very good
and perfect.
If you really have to eek every cycle you can out of an application, you’re likely to either implement part of it in assembly or implement it in C but then look at what assembly the compiler is generating. The -S
flag, like gcc -S myfile.c
, will cause most compilers to output an assembly file instead of an executable.
Operating system implementation
ISAs have some instructions that only the operating system is allowed to execute and which are not generated by most compilers because of that. If you ever work on an operating system, or on a bare-metal project designed to run on an embedded system without an operating system, you are likely to put in a few lines of assembly to use those operating-system-only instructions.