MP4
Custom Memory Allocator
Due dates may be found on the schedule.

Different programming languages have different ways of providing programmers with access to memory.

A Hierarchy of Memory Abstractions
  1. Hardware implements individual bits that can be read or written.
  2. Memory chips arrange the bits to be access as an array (rows) of arrays (blocks) of arrays (bytes) of bits.
  3. Caches wrap small-and-fast and large-and-slow memories in a redundant system that has high capacity, and high speed on average.
  4. Memory management units break memory into pages (typically 4KiB) and allow the kernel to allocate pages to different applications and change what addresses reach each one. This also ensures that distinct applications cannot accidentally change one another’s memory and that the memory used by applications is returned to the system after they exit.
  5. Operating systems abstract address-contiguous sequences of pages into a few adjustable-size segments (stack, heap, code, globals, and so on), exposing assembly instructions like push and system calls like sbrk to find which addresses lie within a given segment and adjust segment size.
  6. Low-level libraries like malloc abstract managing memory within segments into simple allocate/deallocate pairs. C operates here.
  7. Language-level allocators like new merge allocating and initializing memory, a key to the Resource Acquisition Is Initiallization programming idiom. C++ mostly operates here.
  8. Language-enforced deallocation, usually through garbage collection but sometimes though compiler rules, remove most needs to deallocate manually, removing the possiblity for a wide range of memory-related errors. Java operates here.
  9. Language-managed allocations remove the need for the user to think about memory at all. Python operates here.

When a program wants to run code from an untrusted source, controlling that code’s access to memory is a key part of doing that safely. Often this means providing a simple, course-grained memory pool and requiring the untrusted code to manage it themselves. Because of this, it is relatively common to have to implement your own memory allocation tools on top of that underlying pool. Some applications also opt for no-malloc code to improve performance or to fit on resource-constrained embedded systems.

In this MP, we give you a big pool of memory to manage on your own. Your goal is to do so in a way that supports all of the behaviors of malloc, free, and realloc without using those tools themselves and do so while minimizing memory used and maximizing runtime efficiency.

In the past, this course has had you implement malloc using sbrk. That was cool in that once you were done you had built the tool you use in your other C code. However, the debugger uses malloc internally, so if your malloc is buggy then the debugger doesn’t work. We want you to use debuggers, which are available for virtually every other programming task you’ll ever do, so we’ve changed it this semester to a version that lets you use the debugger.

1 Initial Files

mp4.zip contains a very simple, space-inefficient solution as well as VS Code project files and tests. You will modify

2 Naive implementation

We provide the simplest functionally-correct implementation we know of. It treats the heap like a stack; on allocation it pushes a new block on top of the stack and simply ignores all deallocations.

#include "allocator.h"
#include <string.h> // for memcpy

static void *base;  /// the smallest usable address
static size_t used; /// how many byes have been used


/** Called once before any other function here.  *
 * Argument is the smallest usable address.      */
void allocator_init(void *newbase) {
  base = newbase;
  used = 0;
}

/** Called once before each test case.               *
 * Free any used memory and reset for the next test. */
void allocator_reset() { used = 0; }


/** Like malloc but using the memory provided to allocator_init */
void *mymalloc(size_t size) {
  // simplest version pushes each block on a stack
  void *ans = base + used;
  used += size;
  return ans;
}

/** Like free but using the memory provided to allocator_init */
void myfree(void *ptr) {
  // simplest version does nothing
}

/** Like realloc but using the memory provided to allocator_init */
void *myrealloc(void *ptr, size_t size) {
  // simplest version is malloc-copy-free
  if (!size) { myfree(ptr); return NULL; }
  void *ans = mymalloc(size);
  if (ptr && ans) {
    memcpy(ans, ptr, size); // should be min(size, oldsize) but oldsize unknown
    myfree(ptr);
  }
  return ans;
}

3 What you should do

Keep the functional correctness of the provided allocator but make it use less memory without adding too much time. That will probably involve at least the first size of the following steps. The steps in the following subsections are ordered such that earlier items are easier to implement than later ones and also are either have more impact on memory use than later ones or are required components of later ones. Going in order is strongly recommended.

Small testable steps, a professional practice

As you walk through these steps, you will find yourself doing something in step nn and then partially undoing it as part of step m>nm>n. This is part of a general pattern of effective software development: we don’t write large code all at once, we write it as a series of small changes, each individually testable.

It is possible to read all the steps, then try to do them all at once. Technically this could result in writing less code overall. Practically it would actually result in writing buggy code and then having no idea which part is incorrect. If you possibly can, your should always code in small testable steps, even if that means some of those steps involve a bit of extra work that you’ll undo later to get them into a testable configuration.

Making your own tests

The provided test cases in the workloads folder are intended to test both the functionality and performance of your code. Being useful cases for debugging your code was not a design goal of these workloads.

If your code crashes, you might want to write your own test file with its own main.c and so on that calls your functions. This can help you not get distracted by all the overhead of the memory tracking system in testharness.c.

If your code gives an allocator correctness error (such as allocated illegal address or allocated already-used memory) the testharness might be useful, but you should probably minimize the test case before debugging. This means copying the test into mytest.c and then removing as much of it as you can without making the test pass. In other words,

  1. Make mytest.c (when compiled to mytest.so and run using the test harness) fail.
  2. Repeat several times:
    1. Remove something from mytest.c (delete a line, reduce a loop count, etc).
    2. If mytest now passes, reinsert what you just removed.
  3. Debug the single mytest.so workload, using the Debug one workload run option, not Debug all workloads or Run all tests.

3.1 Metadata

Add metadata to each allocation. The usual way to do this is when asked for an n-byte block, allocate an n + sizeof(metadata)-byte block1 metadata n bytes void *result small addresses big addresses ; in the first sizeof(metadata) bytes store the metadata and return a pointer to the first byte after that.

If given a non-NULL ptr, look sizeof(metadata) before it to read the metadata for that block.

You’ll want to change you metadata as you add more features, but initially store the size of the block in it.

This will make your code take more time and need more space (except perhaps making size-increasing reallocs faster), but it will also unlock later changes.

With size metadata it becomes possible to display your entire memory: you know where the first block is, and from its size can find the next, and so on. You might find this useful in some debugging situations, and for doing step 5 below.

Pointer Arithmetic

The functions you write take void * arguments and return void * values. But they will be using some kind of metadata * pointers internally, computed from those void * by some kind of offset.

C and C++ both define ptr + offset to be synonymous with &(ptr[offset]): that is, both find a pointer that points offset values later in the hypothetical array pointed to by ptr. Thus, the byte offset depends on the pointer type:

  • Given int *a = 0x10000, a+1 == 0x10004
  • Given char *a = 0x10000, a+1 == 0x10001
  • Given int **a = 0x10000, a+1 == 0x10008

In pointer arithmetic, void * acts like a pointer to single bytes. Thus, for any type A, ((A *)ptr)-1 == ((void *)ptr) - sizeof(A).

3.2 Shrink efficiently

Don’t copy data or change the pointer when decreasing the size of a block. Also don’t change its size: simply ignore shrink requests.

This will improve your code’s runtime and space.

This will cause the realloc_jitter test to use under 5,000 bytes instead of over 50,000.

3.3 Grow at end of heap

Don’t copy data or change the pointer when increasing the size of the block at the end of the allocated memory.

This will improve your code’s runtime and space.

This will cause the small_resize test to use under 1,000 bytes instead of over 10,000.

3.4 Shrink at end of heap

When deallocating (with myfree) or decreasing (with merealloc) the size of the block at the end of the allocated memory, also return its memory to the unused memory set.

This will improve your code’s space.

This will cause the backstep test to use under 200,000 bytes instead of over 1,000,000 and mergesort_like to use under 200,000 bytes instead of over 700,000.

3.5 Reuse free memory

When deallocating a block that has another block after it, mark it in some way (likely in the metadata) as unused2 used; size=40 40 bytes free; size=20 20 bytes used; size=60 60 bytes small addresses big addresses void *base . If a later allocation wants a block smaller than some current unused block, return the unused block instead of allocating a new block.

If you know block sizes and where the first block is, you can use that to walk the entire set of blocks looking for a large-enough unused block.

This will improve your code’s space but harm it’s runtime.

This will cause the chaos_reuse test to use under 1,000 bytes instead of over 20,000 and chaos_reuse_2 use under 10,000 bytes instead of over 100,000. However, it will slow down most tests, especially linked_list; on my computers it slows that test down to take several seconds, but I’ve had reports that on some machines it slows them to several minutes instead.

Skipping a test

You might want to skip long-running tests. There are at least three ways to do that.

  1. Run tests selectively, using the Test one workload run action or ./tester workloads/chaos_reuse.so command-line invocation.
  2. Temporarily edit the tests you want to skip, commenting out all lines in their mytest functions other than return 0;. To re-enable them, uncomment those lines.
  3. Move the tests (both .c and .so) out of the workloads/ folder. To re-enable them, move them back into that folder.

3.6 Free list

Avoid scanning through used blocks for unused blocks. Large linked data structures may have millions of small used blocks; scanning through them looking for unused blocks can be time-prohibitive.

You can avoid every looking at a used block when hunting for unused blocks by creating some kind of linked data structure (linked list, tree, etc) of unused blocks with nodes stored inside the unused blocks (or their metadata) and a root/head/tail pointer as a global variable, then walking that structure on each allocation.

When creating this structure, it’s good practice to try to serve allocations with the most-recently-freed block first. This will tend to improve cache performance and is optimized for the common case where a free of an nn-byte block is followed by an allocation of another nn-byte block.

Remember to reset the structure when allocator_reset is called.

This will improve your code’s runtime, but may worsen its space slightly if it makes you metadata larger.

This should result in runtimes similar to step 4 and space utilization comparable to step 5.

3.7 Block Merging

When marking a block as unused, see if the block before and/or after it in memory are also unused; if so, merge them into one large unused block instead of several smaller ones. Do this before the end-of-memory deallocation optimization (step 4)

Make sure this doesn’t mess up the links between unused blocks added in step 6. Also note that adjacent-in-memory blocks are rarely adjacent in the list of free blocks.

This is mostly a pair with splitting blocks, and in most cases the two will only result in space savings if they are both implemented correctly. However, they are sufficiently distinct operations that they are usefully implemented and tested independently.

This will cause small_resize_2 to use under 3,000 bytes instead of over 20,000; but it will several previous tests to lose their previous gains (for example, chaos_reuse and chaos_reuse_2 will use 10× more memory) and it may stop passing some of the earlier steps it used to pass. Without the next step, a mixed win at best.

3.8 Block splitting

When allocating into a large unused block, split that block in two, one used for the new allocation and the other one left unused. Do the same when shrinking a block during reallocation.

Make sure you include space for the metadata needed by both blocks when splitting; if there’s not room to split, don’t.

Make sure this doesn’t mess up the links between unused blocks. Also note that adjacent-in-memory blocks are rarely adjacent in the list of free blocks.

This will improve your code’s space.

This will keep the gains to small_resize and small_resize_2 from step 7 and return chaos_reuse and chaos_reuse_2 to under 1,000 and under 10,000 byes, respectively. It will also cause backstep to use under 200,000 bytes instead of over 1,000,000 and backstep_2 to use under 1,500,000 bytes instead of over that.

3.9 Size-aware free list

Avoid looking at small unused blocks when you need a large one. Linked data structures with many adds and removes might result in millions of small unused blocks; when asking for a new large block it can take a very long time to scan past the small ones.

The usual fix for this is to have the linked data structure of unused blocks sort or group them by size. Note that this can make splitting and merging blocks much more complicated.

Because blocks are arranged by size, you have to pick what size to use for each allocation. Intuitively, picking the smallest big-enough free block makes sense, but can result in many unused slivers if not carefully tuned with block splitting. Always picking (and splitting) the largest free block is much easier to get to work well.

This will improve your code’s runtime for some workloads but harm it for others. It will also adjust space used, sometimes up and sometimes down.

To get credit for step 9, you need to still meet all of the previous step’s space and time thresholds and also have fragmented_growing run in under 20ms and use less than 5MB.

3.10 Anticipate resizes.

You don’t have to give the exact amount of memory requested: you can give more instead. If you do give more than subsequent requests to increase size might not require any work.

Giving extra memory can happen naturally when an unused block is larger than needed but too small to split. However, there are many other places you could choose to allocate more than requested, which can make different workflows use more or less memory overall. For example,

This could either improve or harm runtime or space, depending on how it is implemented and what workloads are added.

3.11 Align allocations

Memory systems are built such that accessing a 4-byte value is faster if the address is a multiple of 4 than if it is not. Likewise, 2-byte values are faster if their address is a multiple of 2 and 8-byte values if their address is a multiple of 8; whether this also applies to 16- and 32-byte values varies. Giving addresses that have these multiple-of-a-power-of-two properties is called memory alignment.

Other-sized values are generally faster if they are aligned like the next higher power of 2; so an 18-byte value should be aligned like a 32-byte value. Arrays are generally fast when their elements are aligned; there’s no need to align the entire array.

Many allocators align memory by rounding up allocation sizes a little, resulting in faster code using the memory at the expense of allocating a bit more memory. This won’t have much impact on the allocator’s own performance, but will improve the performance of code using the allocator.

4 Scoring

Implementing up through step 6 ([free lists]) is worth 100%. Partial credit is awarded for doing a subset of the above steps. More partial credit is earned for going in order: steps 3 is worth more partial credit if steps 1 and 2 are also done than if they are not.

No points are earned if the allocator you submit is nonfunctional (i.e. fails to return non-overlapping large-enough memory with each mymalloc call). We strongly recommend keeping a copy of your last points-earning implementation so that you can revert to it if you get your code into a broken state.

Steps 7 + 8 (merging and splitting) are +5% extra credit, but only if 100% earned already. Step 9 (size-aware free tracking) is another +5% extra credit, but only if 105% earned already.

4.1 Contest

Steps 10 and 11 does not award points, but may help you win bragging rights for the best implementation. If you’d like to enter a space-and-time competition, do the following:

  1. Earn at least 100%

  2. On the first line of allocator.c, add a comment // SCREENNAME: anonymous replacing anonymous with a screen name of your choice.

    Keep this clean, inoffensive, and relatively short. Do not use the screen name to impersonate another member of the class.

  3. Put a test case in mytest.c that you believe your allocator will do better on than most other allocators. This needn’t be original: you can copy existing code (with a-> in front of all the allocations) if you wish, but if you do so include an acknowledgment of its source in a comment in the file.

    Make sure your test case terminates! Currently the competition system doesn’t time out individual test cases, so if yours has an infinite loop it could break the entire competition system.

Periodically, we will run all contestant allocators against all provided and contestant tests and post the space and time results here.