1 Initial Files

mp2.zip contains header files, testers, and a Makefile. You will add to utf8lib.c (and only that file).

utf8lib.h documents the functions you should add to utf8lib.c. We recommend first copying each function declaration to utf8lib.c with a minimal function body (such as return 0;) so that the code will compile and run, failing all tests. Then start implementing the functions one at a time.

2 UTF-8 Handling

Write functions that treats a char * as if it were a string of characters.

In C, char means 1-byte integer. It was named after characters based on the 1-byte encodings that were used when C was invented for Latin-derived languages, but char and character are distinct concepts.

In Unicode, character is an abstract entity like capital R with acute accent or orange heart. Unicode gives each character a number between 0 and 1,114,111 called it’s code point, like 340 for capital R with acute accent (Ŕ) or 129,505 for orange heart (🧡). UTF-8 encodes numbers as bytes, with small numbers using fewer bytes than big numbers, using up to 4 bytes for the highest code points.

Thus 1 character is encoded in UTF-8 using 1, 2, 3, or 4 chars.

Two UTF-8-handling functions require non-trivial coding to handle UTF-8’s various cases:

int decodeCharacter(const char **utf8) reads one code point from a UTF-8-encoded string, moving the pointer past it. Repeated calls will read all code points in a string.
int encodeCharacter(char **utf8, size_t *space, int codepoint) writes one code point encoded into UTF-8 into an array, moving the pointer past it and updating the remaining space. Repeated calls will fill an entire string with UTF-8 characters.

Why char **?

A char is one byte.

An array of char is many bytes all in a row.

A char * tells is where one byte in that array is located. Adding or subtracting to the char * moves us up or down the row.

We want to have multiple function calls work cooperatively. That means each needs to change the char *: after processing a byte the pointer moves past the byte. But C is call-by-value so each function invocation gets its own copy of each argument value: if we gave them a char * then changes to each function’s char * value would be lost when the function returns.

A char ** tells us where a char * is located. Instead of changing the char **utf8 itself, we change *utf8.

As an analogy, consider a group of TAs (functions) helping a group of students (chars). There’s a list of students wanting help (an array of char) and a position on that list where help should occur (a char *). If each TA were just told the position, they’d all visit the same student. Instead, if we put that position on a whiteboard and give the TAs the position of the whiteboard (a char **) then all of them can change the same position (by modifying *whiteboard: i.e. the value stored on the whiteboard), allowing them to work together effectively.

Two other functions can be fairly simple, especially if you use decodeCharacter to implement one of them.

size_t strlen8c(const char *s) is like strlen but returns a count of characters, not chars.
size_t strlen8i(const int *s) returns how many bytes would be needed to encode the given code points in UTF-8.

In C, char has an implementation-defined signedness character. That means that on one computer a char assigned the value 0x9A has the value +154, on another it has the value −102, making comparisons complicated. If your code needs to compare char values or store them in a larger type such as an int, you should manually cast them to either signed char or unsigned char first.

It is likely that your computer and the testing computer disagree on the signedness of char. If your code works for you but not when submitted, double-check that you’re never comparing plain chars, only signed chars or unsigned chars.

If your code handles char signedness correctly, you should be able to add either -fsigned-char or -funsinged-char to the ends of the CFLAGS line of Makefile (you can’t add both at once but either without the other should be OK) without changing which tests pass.

3 String replacements

Write a structure and functions that support find-and-replace in a string. Because of how UTF-8 is designed, if you write these thinking only about chars they will also work with UTF-8 characters.

A string replacement replaces one substring with another substring. A string replacement set has multiple substring replacements. These could be used for example to remove emoji (e.g. replace "🍞" with "bread"), create rebuses (e.g. replace "hear" with "🩷 − 🍵"¹¹ rebuses are image sequences encoding word puzzels often based on spelling; in this example, heart minus tea = hear), etc. For full credit, use an algorithm that is $O(n)$ in the length of the input string²² We do not specify how efficient it needs to be in the size of the set of replacements. If several replacements are possible, resolve them using the following rules³³ These rules are common to many replace-all operations in many languages and libraries, but don’t have a single standard name. I’ve seen them called earliest first, longest first match, earliest longest match, and many other phrases that don’t fully describe them.:

If two replacement options do not overlap, apply them both.
If two replacement options do overlap, apply the one that starts earlier in the string. If they start at the same position, apply the longer one.
Only match text in the initial string against new replacement options, not the outputs of prior replacements.

Suppose we had the following replacements to make:

Old	New
`"34"`	`"㉞"`
`"340"`	`"☺"`
`"4000"`	`"𐄥"`

Then the string "34000" would become "☺00": that’s an earlier match than "3𐄥" and the same earliness but a longer match than "㉞000".

Your code will need to handle multiple distinct replacement sets. To support that, you’ll define a replacement set structure for storing the desired replacements:

typedef struct _replacement_set_t ReplacementSet; in the .h file says there will be some structure defined in a .c file; we’re going to call it a ReplacementSet here. In the .c file we define it as struct _replacement_set_t { /*...*/ }; but can use it with the name ReplacementSet everywhere else (that’s what the typedef does for us).

We do not tell you what data structure to implement. After taking CS 225, we assume you have several options you know how to use (hash table, tree map, parallel arrays, list of pairs, trie, etc) and leave you to pick one of your choosing.

Because your replacement sets may need to allocate heap memory, you’ll implement two memory-management functions:

ReplacementSet *newReplacementSet() to allocate a new ReplacementSet; this function should probably call malloc.
void deleteReplacementSet(ReplacementSet *set) to deallocate an existing ReplacementSet; this function should probably call free.

The work of string replacement is done in two functions:

void addReplacement(ReplacementSet *set, const char *from, const char *to) adds a replacement to the replacement set. You’re welcome to store from and to directly in the ReplacementSet, but will probably need to allocate some memory to hold those pointers. Depending on your data structure choice, you might also want to process the strings in some fashion.
char *replaceAll(ReplacementSet *set, const char *input) allocates a new char array (using malloc) and applies all replacement options it can while copying input into it.

4 Debugging

The provided tester.c uses a special fork instruction to run two processes, one to run tests and the other to make sure the first doesn’t run for too long. In theory VS Code can debug this kind of code, but in practice sometimes it doesn’t seem to see break points when code is designed like this.

If you comment our all lines of tester.c’s main function except allTests();, the time-out functionality will be disabled and the debugger will have an easier time understanding your code. Note that these lines won’t be commented on the testing server, so you eventually need to pass the tests with those lines uncommented.

5 Submission and Grading

Submit on the submission site. Only submit utf8lib.c; do not modify any other file.

5.1 Grading

Remove any printouts from your code before submitting.

The tester.c program runs 92 tests worth 100 points. Of these, 70 are for the UTF processing, 20 are for substring replacement functionality, and 10 are for substring replacement efficiency. No-warnings compilation is required. Valgrind will also be run, and if it gives errors tester.c’s points will be reduced by 25%.