Machine Lab 4: A Lexer for OCaml's Type System

To complete this ML, make sure you are familiar with the lectures on both regular expressions and lexing.

After completing this ML, you should understand how to implement a practical lexer using a lexer generator such as Lex. Hopefully you will also gain a sense of appreciation for the availability of lexer generators, instead of having to code a lexer completely from scratch.

The language we are making a parser for is called PicoML, which is basically a subset of OCaml. It includes functions, lists, integers, strings, let expressions, etc.

Recall from lecture that the process of transforming program code (i.e, as ASCII or Unicode text) into an abstract syntax tree (AST) has two parts. First, the lexical analyzer (lexer) scans over the text of the program and converts the text into a sequence of tokens, usually as values of a user-defined disjoint datatype. These tokens are then fed into the parser, which builds the actual AST.

Note that it is not the job of the lexer to check for correct syntax - this is done by the parser. In fact, our lexer will accept (and correctly tokenize) strings such as "if if let let if if else", which are not valid programs.

The tokens of a programming language are specified using regular expressions, and thus the lexing process involves a great deal of regular-expression matching. It would be tedious to take the specification for the tokens of our language, convert the regular expressions to a DFA, and then implement the DFA in code to actually scan the text.

Instead, most languages come with tools that automate much of the process of implementing a lexer in those languages. To implement a lexer with these tools, you simply need to define the lexing behavior in the tool's specification language. The tool will then compile your specification into source code for an actual lexer that you can use. In this ML, we will use a tool called ocamllex to build our lexer.

ocamllex Specification

What follows below is the short version of the ocamllex specification. You will need to become especially familiar with ocamllex's regular expression syntax.

ocamllex's lexer specification is slightly reminiscent of an OCaml match statement:

              rule myrule = parse

               | regex1 { action1 }

               | regex2 { action2 }

                 ...

When this specification is compiled, it creates a recursive function called myrule that does the lexing. Whenever myrule finds something that matches regex1, it consumes that part of the input and returns the result of evaluating the expression action1. In our code, the lexing function should return the token it finds.

Here is a quick example:

              rule mylexer = parse

                | [' ' '\t' '\n']          
                  { mylexer lexbuf }

                | ['x' 'y' 'z']+  as thingy { ... thingy ... }

The first rule says that any whitespace character (either a space, tab, or newline) should be ignored. lexbuf is a special object that represents "the rest of the input" - the stuff after the whitespace that was just matched. By saying mylexer lexbuf , we are recursively calling our lexing rule on the remainder of the input and returning its result. Since we do nothing with the whitespace that was matched, it is effectively ignored.

The second rule shows a named regex. By naming the regex like this, whatever string matched the regex is bound to the name thingy and available inside the action code for this rule (as is lexbuf as before). Note that you can also name just parts of the regex. The return value from this action should somehow use the value of thingy.

You can also define multiple lexing functions - see the online documentation for more details (they are referred to as "entrypoints"). Then from the action of one rule, you can call a different lexing function. Think of the lexer on the whole as being a big state machine, where you can change lexing behaviors based on the state you are in (and transition to a different state after seeing a certain token). This is convenient for defining different behavior when lexing inside comments, strings, etc.

back to top

ml4common.cmo contains the definition of the type for our tokens, a type and exceptions to support the completion of Problem 2.
ml4.mll is the skeleton for the lexer specification. token is the name of the lexing rule that is already partially defined. You may find it useful to add your own helper functions to the header section. The footer section defines the get_all_tokens function that drives the lexer, and should not be changed. Modify ml4.mll , and commit it to svn in assignments/ml4 as usual.

back to top

To compile your lexer specification to OCaml code, use the command

ocamllex ml4.mll

This creates a file called ml4.ml (note the slight difference in names). Then you can run tests on your lexer in OCaml using the token function that is already included in ml4.mll. To see all the tokens producible from a string, use get_all_tokens.

            # #load "ml4common.cmo";;

            # #use "ml4.ml";;

              ...

            # get_all_tokens "some string to test";;

            - : Ml4common.token list =

            [IDENT "some"; IDENT "string"; IDENT "to"; IDENT "test"]

back to top

Overview

Overview of Lexical Analysis (Lexing)

Lexer Generators

ocamllex Specification

Provided Code

Compiling & Testing