MP2
UTF-8
Due dates may be found on the schedule.

In this MP, you will work with strings in C. While there are other character encodings, most strings today are stored in UTF-8. Working with these will allow you practice with

1 Initial Files

mp2.zip contains header files, testers, and a Makefile. You will add to utf8lib.c (and only that file).

utf8lib.h documents the functions you should add to utf8lib.c. We recommend first copying each function declaration to utf8lib.c with a minimal function body (such as return 0;) so that the code will compile and run, failing all tests. Then start implementing the functions one at a time.

2 UTF-8 Handling

Write functions that treats a char * as if it were a string of characters.

In C, char means 1-byte integer. It was named after characters based on the 1-byte encodings that were used when C was invented for Latin-derived languages, but char and character are distinct concepts.

In Unicode, character is an abstract entity like capital R with acute accent or orange heart. Unicode gives each character a number between 0 and 1,114,111 called it’s code point, like 340 for capital R with acute accent (Ŕ) or 129,505 for orange heart (🧡). UTF-8 encodes numbers as bytes, with small numbers using fewer bytes than big numbers, using up to 4 bytes for the highest code points.

Thus 1 character is encoded in UTF-8 using 1, 2, 3, or 4 chars.

Two UTF-8-handling functions require non-trivial coding to handle UTF-8’s various cases:

Why char **?

A char is one byte.

An array of char is many bytes all in a row.

A char * tells is where one byte in that array is located. Adding or subtracting to the char * moves us up or down the row.

We want to have multiple function calls work cooperatively. That means each needs to change the char *: after processing a byte the pointer moves past the byte. But C is call-by-value so each function invocation gets its own copy of each argument value: if we gave them a char * then changes to each function’s char * value would be lost when the function returns.

A char ** tells us where a char * is located. Instead of changing the char **utf8 itself, we change *utf8.

As an analogy, consider a group of TAs (functions) helping a group of students (chars). There’s a list of students wanting help (an array of char) and a position on that list where help should occur (a char *). If each TA were just told the position, they’d all visit the same student. Instead, if we put that position on a whiteboard and give the TAs the position of the whiteboard (a char **) then all of them can change the same position (by modifying *whiteboard: i.e. the value stored on the whiteboard), allowing them to work together effectively.

Two other functions can be fairly simple, especially if you use decodeCharacter to implement one of them.

In C, char has an implementation-defined signedness character. That means that on one computer a char assigned the value 0x9A has the value +154, on another it has the value −102, making comparisons complicated. If your code needs to compare char values or store them in a larger type such as an int, you should manually cast them to either signed char or unsigned char first.

It is likely that your computer and the testing computer disagree on the signedness of char. If your code works for you but not when submitted, double-check that you’re never comparing plain chars, only signed chars or unsigned chars.

If your code handles char signedness correctly, you should be able to add either -fsigned-char or -funsinged-char to the ends of the CFLAGS line of Makefile (you can’t add both at once but either without the other should be OK) without changing which tests pass.

3 String replacements

Write a structure and functions that support find-and-replace in a string. Because of how UTF-8 is designed, if you write these thinking only about chars they will also work with UTF-8 characters.

A string replacement replaces one substring with another substring. A string replacement set has multiple substring replacements. These could be used for example to remove emoji (e.g. replace "🍞" with "bread"), create rebuses (e.g. replace "hear" with "🩷 − 🍵"1 rebuses are image sequences encoding word puzzels often based on spelling; in this example, heart minus tea = hear), etc. For full credit, use an algorithm that is O(n)O(n) in the length of the input string2 We do not specify how efficient it needs to be in the size of the set of replacements. If several replacements are possible, resolve them using the following rules3 These rules are common to many replace-all operations in many languages and libraries, but don’t have a single standard name. I’ve seen them called earliest first, longest first match, earliest longest match, and many other phrases that don’t fully describe them.:

Suppose we had the following replacements to make:

Old New
"34" "㉞"
"340" "☺"
"4000" "𐄥"

Then the string "34000" would become "☺00": that’s an earlier match than "3𐄥" and the same earliness but a longer match than "㉞000".

Your code will need to handle multiple distinct replacement sets. To support that, you’ll define a replacement set structure for storing the desired replacements:

Because your replacement sets may need to allocate heap memory, you’ll implement two memory-management functions:

The work of string replacement is done in two functions:

4 Debugging

The provided tester.c uses a special fork instruction to run two processes, one to run tests and the other to make sure the first doesn’t run for too long. In theory VS Code can debug this kind of code, but in practice sometimes it doesn’t seem to see break points when code is designed like this.

If you comment our all lines of tester.c’s main function except allTests();, the time-out functionality will be disabled and the debugger will have an easier time understanding your code. Note that these lines won’t be commented on the testing server, so you eventually need to pass the tests with those lines uncommented.

5 Submission and Grading

Submit on the submission site. Only submit utf8lib.c; do not modify any other file.

5.1 Grading

Remove any printouts from your code before submitting.

The tester.c program runs 92 tests worth 100 points. Of these, 70 are for the UTF processing, 20 are for substring replacement functionality, and 10 are for substring replacement efficiency. No-warnings compilation is required. Valgrind will also be run, and if it gives errors tester.c’s points will be reduced by 25%.