In this MP, you will work with strings in C. While there are other character encodings, most strings today are stored in UTF-8. Working with these will allow you practice with
char *
)mp2.zip
contains header files, testers, and a Makefile. You will add to utf8lib.c
(and only that file).
utf8lib.h
documents the functions you should add to utf8lib.c
. We recommend first copying each function declaration to utf8lib.c
with a minimal function body (such as return 0;
) so that the code will compile and run, failing all tests. Then start implementing the functions one at a time.
Write functions that treats a char *
as if it were a string of characters.
In C, char
means 1-byte integer.
It was named after characters based on the 1-byte encodings that were used when C was invented for Latin-derived languages, but
and char
character
are distinct concepts.
In Unicode, character is an abstract entity like capital R with acute accent
or orange heart.
Unicode gives each character a number between 0 and 1,114,111 called it’s code point, like 340 for capital R with acute accent (Ŕ) or 129,505 for orange heart
(🧡). UTF-8 encodes numbers as bytes, with small numbers using fewer bytes than big numbers, using up to 4 bytes for the highest code points.
Thus 1 character is encoded in UTF-8 using 1, 2, 3, or 4 char
s.
Two UTF-8-handling functions require non-trivial coding to handle UTF-8’s various cases:
int decodeCharacter(const char **utf8)
reads one code points from a UTF-8-encoded string, moving the pointer past it. Repeated calls will read all code point in a string.
int encodeCharacter(char **utf8, size_t *space, int codepoint)
writes one code point encoded into UTF-8 into an array, moving the pointer past it and updating the remaining space. Repeated calls will fill an entire string with UTF-8 characters.
char **
?
A char
is one byte.
An array of char
is many bytes all in a row.
A char *
tells is where one byte in that array is located. Adding or subtracting to the char *
moves us up or down the row.
We want to have multiple function calls work cooperatively. That means each needs to change the char *
: after processing a byte the pointer moves past the byte. But C is call-by-value so each function invocation gets its own copy of each argument value: if we gave them a char *
then changes to each function’s char *
value would be lost when the function returns.
A char **
tells us where a char *
is located. Instead of changing the char **utf8
itself, we change *utf8
.
As an analogy, consider a group of TAs (functions) helping a group of students (char
s). There’s a list of students wanting help (an array of char
) and a position on that list where help should occur (a char *
). If each TA were just told the position, they’d all visit the same student. Instead, if we put that position on a whiteboard and give the TAs the position of the whiteboard (a char **
) then all of them can change the same position (by modifying *whiteboard
: i.e. the value stored on the whiteboard), allowing them to work together effectively.
Two other functions can be fairly simple, especially if you use decodeCharacter
to implement one of them.
size_t strlen8c(const char *s)
is like strlen
but returns a count of characters, not char
s.
size_t strlen8i(const int *s)
returns how many bytes would be needed to encode the given code points in UTF-8.
In C, char
has an implementation-defined signedness character. That means that on one computer a char
assigned the value 0x9A
has the value +154, on another it has the value −102, making comparisons complicated. If your code needs to compare char
values or store them in a larger type such as an int
, you should manually cast them to either signed char
or unsigned char
first.
It is likely that your computer and the testing computer disagree on the signedness of char
. If your code works for you but not when submitted, double-check that you’re never comparing plain char
s, only signed char
s or unsigned char
s.
If your code handles char signedness
correctly, you should be able to add either -fsigned-char
or -funsinged-char
to the ends of the CFLAGS
line of Makefile
(you can’t add both at once but either without the other should be OK) without changing which tests pass.
Write a structure and functions that support find-and-replace in a string. Because of how UTF-8 is designed, if you write these thinking only about char
s they will also work with UTF-8 characters.
A string replacement replaces one substring with another substring. A string replacement set has multiple substring replacements. These could be used for example to remove emoji (e.g. replace "🍞"
with "bread"
), create rebuses (e.g. replace "hear"
with "🩷 − 🍵"
1 rebuses are image sequences encoding word puzzels often based on spelling; in this example, heart minus tea = hear), etc. For full credit, use an algorithm that is in the length of the input string2 We do not specify how efficient it needs to be in the size of the set of replacements. If several replacements are possible, resolve them using the following rules3 These rules are common to many replace-all operations in many languages and libraries, but don’t have a single standard name. I’ve seen them called earliest first
, longest first match
, earliest longest match
, and many other phrases that don’t fully describe them.:
Suppose we had the following replacements to make:
Old | New |
---|---|
"34" |
"㉞" |
"340" |
"☺" |
"4000" |
"𐄥" |
Then the string "34000"
would become "☺00"
: that’s an earlier match than "3𐄥"
and the same earliness but a longer match than "㉞000"
.
Your code will need to handle multiple distinct replacement sets. To support that, you’ll define a replacement set structure for storing the desired replacements:
typedef struct _replacement_set_t ReplacementSet;
in the .h
file says there will be some structure defined in a
In the .c
file; we’re going to call it a ReplacementSet
here..c
file we define it as struct _replacement_set_t { /*...*/ };
but can use it with the name ReplacementSet
everywhere else (that’s what the typedef
does for us).
We do not tell you what data structure to implement. After taking CS 225, we assume you have several options you know how to use (hash table, tree map, parallel arrays, list of pairs, trie, etc) and leave you to pick one of your choosing.
Because your replacement sets may need to allocate heap memory, you’ll implement two memory-management functions:
ReplacementSet *newReplacementSet()
to allocate a new ReplacementSet
; this function should probably call malloc
.
void deleteReplacementSet(ReplacementSet *set)
to deallocate an existing ReplacementSet
; this function should probably call free
.
The work of string replacement is done in two functions:
void addReplacement(ReplacementSet *set, const char *from, const char *to)
adds a replacement to the replacement set. You’re welcome to store from
and to
directly in the ReplacementSet
, but will probably need to allocate some memory to hold those pointers. Depending on your data structure choice, you might also want to process the strings in some fashion.
char *replaceAll(ReplacementSet *set, const char *input)
allocates a new char
array (using malloc
) and applies all replacement options it can while copying input
into it.
The provided tester.c
uses a special fork
instruction to run two processes, one to run tests and the other to make sure the first doesn’t run for too long. In theory VS Code can debug this kind of code, but in practice sometimes it doesn’t seem to see break points when code is designed like this.
If you comment our all lines of tester.c
’s main
function except allTests();
, the time-out functionality will be disabled and the debugger will have an easier time understanding your code. Note that these lines won’t be commented on the testing server, so you eventually need to pass the tests with those lines uncommented.
Submit on the submission site. Only submit utf8lib.c
; do not modify any other file.
Remove any printouts from your code before submitting.
The tester.c
program runs 92 tests worth 100 points. Of these, 70 are for the UTF processing, 20 are for substring replacement functionality, and 10 are for substring replacement efficiency. No-warnings compilation is required. Valgrind will also be run, and if it gives errors tester.c
’s points will be reduced by 25%.