A̵d̵v̵a̵n̵c̵e̵d̵ ̵F̵u̵n̵c̵t̵i̵o̵n̵a̵l̵ Compiler Programming for OO/Java Developers, Appendix A: DSL Parser Generation Using JavaCC and Syntactic Sugar

I’ve run a series of posts on functional programming for Java developers, and had actually planned to stop it at part 4, but it turns out I still have (a bit) more to say, so here’s a small appendix (and there’s likely a fifth part coming as well sometime). The first four parts introduced various themes from the functional world that might be useful for OO/Java developers. This was done by building and gradually expanding a simple expression language. Throughout the series, we constructed various expressions by explicitly instantiating the domain classes, which was far from readable. In this series we will look at a way of automatically generating such domain models from a textual description.

The “modern” way to read a domain model from a textual representation is to use one of the generic formats (XML, JSON, YAML) and a generic parser. This is very developer-friendly, but actual humans (and even most developers) do not really enjoy writing these formats (one is too verbose, one is too under-specified, and one will yell whitespace errors at you if you look at it wrong). The alternative often used is to make ad-hoc parsers usng String.split or regular expressions. These parsers are typically error-prone, difficult to expand, and provide little diagnostics for the end-user to fix badly formatted input. The old-fashioned way to do this is to use a dedicated parser generator tool to generate a custom parser. I rarely see this done because I believe a lot of people are not aware of the existence of such tools, or incorrectly believe them to be complex to use.

In this appendix, we’ll not dig into any more functional concepts, but instead I’ll introduce a simple and powerful parser generator for Java as well as the underlying concepts necessary to understand how to use it. There are many different tools for this; in the past, I used a variant of the Java CUP Parser Generator. It is perfectly fine and I have no negative things to say about it, but more recently, I needed to make a small domain-specific language (DSL) for a client, and stumbled upon JavaCC, which is even easier to use than CUP, so that’s what we’ll do in this post. The end goal is to write a parser that allows us to enter expressions as developed in part 1 of the series as text and automatically obtain a domain model. We will also take a quick look at how we can use a parser generator to implement syntactical sugar (see part 3 for other ways to deal with syntactic sugar).

Traditionally, a parser comprises two things, a lexer/scanner and the parser itself. The lexer/scanner is roughly responsible for collecting individual characters into words, and the parser is roughly responsible for collecting words into sentences. Historically, these two steps were handled by two different tools, but nowadays most parser generators support both steps, though it is possible to bring your own scanner for JavaCC (and IIRC, also for CUP).

Lexer/scanner

The lexer/scanner (henceforth just scanner) is responsible for reading characters and constructing “words,” called symbols or tokens (henceforth tokens). Tokens are things that build up our expressions, similar to how words, spacing and punctuation build up sentences. Our expressions are relatively simple: they contain numbers, variables, addition, multiplication, assignments, and optionally parentheses. We would most often define tokens for all those words. We start with the operators and parentheses:

TOKEN :
{
  < LPAREN : "(" >
| < RPAREN : ")" >
}

TOKEN :
{
  < PLUS : "+" >
| < TIMES : "*" >
| < EQUALS : ":=" >
}

All that happens here is that we assign names to different strings. We also need to define numbers and variables. While the operators are just simple strings, we’ll use regular expressions for numbers and variables:

TOKEN :
{
  < ID : ([ "a"-"z", "A"-"Z" ])+ >
| < NUMBER :
    (
      ("-")? ([ "0"-"9" ])+
    ) >
}

The regexp syntax is a bit non-standard, but relatively simple. See the JavaCC examples for more information. We just state that an ID is any sequence of one or more lowercase or uppercase letters and a NUMBER is an optional minus followed by one or more numerical digits.

Finally, we want to tell the scanner to skip any whitespace. Languages that assign meaning to whitespace are dumb. We could just create special tokens for whitespace, but in reality we just want to skip them: they are what is between what we are interested in, not what we are interested in. We therefore declare whitespace like this:

SKIP :
{
  " "
| "\t"
| "\n"
| "\r"
| "\f"
}

Note that we declare the whitespace tokens as SKIP instead of TOKEN. This means they never interfere with our parsing and are just consumed. It does have a little impact on the parsing though, as “hello world” is parsed as two times ID while “helloworld” is parsed as just one ID. “hello123” and “hello 123” are both parsed as ID followed by NUMBER in our language.

This is all that is needed to get you started with the scanner. We’ll now go over some more advanced topics to also deal with comments, but if you are not interested in that, just skip ahead to the next section.

To deal with comments, we want to completely ignore the regular lexical rules. If we have a Java comment like /* 1 + 2 */ we want to ignore the 1 + 2, and if we encounter /* 1 + */ we do not want to report an error (even though “1 +” is an invalid expression). To do that, we make use of lexical states. Whenever we see the beginning of a comment, we switch to a special state where all the regular rules are ignored, and then we switch back when we reach the end of the comment. This is how to implement simple Java-style comments:

MORE :
{
  "//" : IN_SINGLE_LINE_COMMENT
| 
  < "/**" ~[ "/" ] >
  {
    input_stream.backup(1);
  }
  : IN_FORMAL_COMMENT
| 
  "/*" : IN_MULTI_LINE_COMMENT
}

< IN_SINGLE_LINE_COMMENT >
SPECIAL_TOKEN :
{
  < SINGLE_LINE_COMMENT :
    "\n"
  | "\r"
  | "\r\n" > : DEFAULT
}

< IN_FORMAL_COMMENT >
SPECIAL_TOKEN :
{
  < FORMAL_COMMENT : "*/" > : DEFAULT
}

< IN_MULTI_LINE_COMMENT >
SPECIAL_TOKEN :
{
  < MULTI_LINE_COMMENT : "*/" > : DEFAULT
}

< IN_SINGLE_LINE_COMMENT, IN_FORMAL_COMMENT, IN_MULTI_LINE_COMMENT >
MORE :
{
  < ~[ ] >
}

The first set of tokens match the start of a single-line comment, a JavaDoc comment, and a regular block comment. The JavaDoc match ensures that it does not match /**/. For each of them, we use the token type MORE, which means that the matched data is accumulated for inclusion once we match a complete token. The next set of tokens match the end of single-line comments. We do that by matching an end-of-line character in either Unix, Mac OS (before X), or DOS formats. After matching this, we go back to the DEFAULT state and are ready to match expression tokens again. We do the same for JavaDoc and block comments in the following two groups. All comments produce a SPECIAL_TOKEN. This means they are not considered when parsing our expressions (but we can later get access to them). We could also just use SKIP for all the comment parsing, but if we wanted the comments for printing later, it might be useful to save them. Finally, if we are inside any kind of comment, we match any character and append it to the token.

Having multiple lexical states can be useful for comments, for parsing other data with multiple different languages in a single file, e.g., the tag-soup known from PHP, JSP/JSF, or Angular (PHP does not actually care about the HTML it is embedded in, but JSP/JSF and Angular do). Such languages could have one state for parsing HTML and one for parsing PHP/Java/Typescript. Multiple lexical states are also useful for languages that have String interpolation (e.g., Kotlin, Swift, Angular or close to any language other than Java). Such languages would not just parse a string as a simple token, but would need to jump to a lexical state at the beginning of a string, jump back on interpolation (or to a special restricted language supported inside string interpolation), back to a string when the interpolation ends, and finally back to the main language once the string ends.

Parser

To generate our parser, we use a syntax similar to a context-free grammar or BNF. In reality we cannot parse a full context-free grammar, but unlike in the old days you almost don’t have to worry about it. In the “good” old days, most parser generators supported a language family called LALR, which meant having to take care when constructing lists. JavaCC is instead a LL(k) parser, meaning it will handle almost everything you throw at it except for one feature which will come to bite us soon enough.

To write our language, we have to make non-terminals or productions. They describe how we construct new types from simpler types, eventually going all the way down to tokens. Let us start with two simple productions:

String Variable() :
{
  Token t;
}
{
  t = < ID >
  {
    return t.image;
  }
}

int Constant() :
{
  Token t;
}
{
  t = < NUMBER >
  {
    return Integer.parseInt(t.image);
  }
}

Here, we defined two productions. Lines 1 and 12 define the individual productions. The production Variable has type String and the production Constant has type int. Each production comprises two blocks. The first blocks allow us to define variables used during parsing, and the second describe the parsing itself. Both productions declare a single variable to hold a Token. The Variable production matches a single token ID and stores it in t (l. 6). When that token is matched, the name of the ID (the image of the Token) is extracted and returned (ll. 7-9). Constant works similarly: it matches a token NUMBER, but converts it to an int before returning.

Let us look at the production for expressions:

Expression < Integer > Expression() :
{
  Expression < Integer > a, b;
}
{
  a = Term()
  (
    < PLUS > b = Term()
    {
      a = new Plus < Integer > (a, b);
    }
  )*
  {
    return a;
  }
}

The Expression production returns a result of type Expression<Integer>; the spacing (enforced by the Eclipse JavaCC plugin) combined with the double appearance of Expression as both the type and name of the production makes this a bit hard to decipher at first. This production defines two variables holding Expressions (essentially the left and right side of an addition). When parsing an expression, we look for a Term a (l. 6) followed by any number (ll. 7-12) of appearances of the token PLUS followed by a Term b (l. 8). Whenever a PLUS and another Term is encountered, we wrap the existing a and the new b into a Plus expression which we assign to a. When we no longer match PLUS and Terms, we return a. We can think of a as an accumulator for the result and b as an intermediate variable to build the result. This is a common pattern to match lists of things in JavaCC.

We might be surprised to see a Term and not another Expression. This is the one limitation of an LL(k) parser I mentioned. No production may start with something that starts with itself. This in particular means that the Expression production cannot start by matching an Expression a (and since an Expression can start with a Term, a Term may not start with an Expresion nor a Term). For that reason, we need to define a hierarchy of productions explicitly in the parser, reflecting the priority of our operators. Normally, multiplication binds tighter than addition, so they need to be on different levels. LALR parser aficionados instead have to play around with shift-reduce conflicts to solve this problem. Since we have a hierarchy of expression blocks, let’s move on to the Terms:

Expression < Integer > Term() :
{
  Expression < Integer > a, b;
}
{
  a = Factor()
  (
    < TIMES > b = Factor()
    {
      a = new Times < Integer > (a, b);
    }
  )*
  {
    return a;
  }
}

We see that the Term is virtually identical to the Expression, except it matches multiplication and uses a Factor production instead. Terms also have type Expression<Integer> (so we cannot start with the same production, but we can return the same type in the beginning of a production).

In Factor, something new happens:

Expression < Integer > Factor() :
{
  int c;
  String v;
  Expression < Integer > a, b;
}
{
  c = Constant()
  {
    return new Constant < Integer > (c);
  }
| v = Variable() < EQUALS > a = Expression() ";" b = Expression()
  {
    return new Assignment < Integer > (v, a, b);
  }
| v = Variable()
  {
    return new Variable < Integer > (v);
  }
| < LPAREN > a = Expression() < RPAREN >
  {
    return a;
  }
}

Aside from defining more variables, a Factor also has multiple alternatives in the second block. Alternatives are separated by a pipe/bar.

A Factor can be a Contant c (l. 8) which is returned as such (ll. 9-11) or it can be a Variable v (l. 16) returned as a Variable expression (ll. 17-19).

A Factor can also be an Expression inside parentheses (l. 20), which is just returned as is since our language does not have a notion of parentheses (parentheses are necessary in written expressions to distinguish “(1 + 2) * 3” from “1 + (2 * 3)” when writing “1 + 2 * 3” – though our parser has a preference for the second interpretation, we need parentheses to input the first if that is what we mean – but this is not necessary in our Expression classes which are always explicitly nested).

Finally, a Factor may be an assignment (l. 12). An assignment is a Variable followed by EQUALS, an Expression, a semi-colon and a final Expression. We note that we can also use a string constant “;” instead of creating an explicit token if we want. We also see that we can refer to Expression in our Factor without issues as they are not the first element of the production.

We also add an explicit top level production:

Expression < Integer > Top() :
{
  Expression < Integer > a;
}
{
  a = Expression() < EOF >
  {
    return a;
  }
}

The Top production is just an Expression followed by the predefined token EOF. This is not strictly necessary, but forces the parser to parse everything we give it. Otherwise it would parse “1 +” as just “1” which is the longest legal prefix that can be parsed. By forcing reading EOF as well, “1 +” will yield an error.

And that’s all there is to making our parser. We need a bit of boilerplate at the beginning of our description:

options
{
  LOOKAHEAD = 2;
  JAVA_UNICODE_ESCAPE = true;
  STATIC = false;
}

PARSER_BEGIN(ExpressionParser)
package scratch;
import java.io.StringReader;
import scratch.Continuations.*;

public class ExpressionParser
{
  public static Expression < Integer > parse(String input) throws ParseException
  {
    return new ExpressionParser(new StringReader(input)).Top();
  }
}

PARSER_END(ExpressionParser)

The first block provides options to JavaCC. LOOKAHEAD makes the parser look two symbols ahead during parsing. This technically means we are generating an LL(2) parser. This is needed during parsing of the Expression and Term lists, as well as during parsing of a Factor. Both the assignment (l. 12) and the variable (l. 16) alternatives start with a Variable (which is really an ID), making it ambiguous which one to try and produce. By looking at the next symbol (either EQUALS or something else), the parser can decide between trying to parse an assignments or finish parsing a variable expression, so it is beneficial to look at two sumbols ahead instead of just one as is default. If we do not define the LOOKAHEAD option, JavaCC will default to 1 and helpfully tell what it really should be. The other two options make the parser easier to re-use and better Java code, so just use them.

Then comes a block for defining the parser. We define a parser with name ExpressionParser, instructing JavaCC to put the generated parser in a class ExpressionParser. We provide it with the necessary imports (we import the Expression classes from the Continuations.java file you can download near the end of part 4) and add a simple static parse method wrapping a string input and returning the correct parsed Top production result.

Compiling our parser using JavaCC yields 7 Java classes. It is possible that the generated ExpressionParser class doesn’t compile if we have made an error in our description, but when all is ready, we can use the new parser like this:

	void parse() throws ParseException {
		evaluate("1");
		evaluate("1 + 2");
		evaluate("1+2");
		assertThrows(ParseException.class, () -> evaluate("1+"));
		evaluate("1 + 2 * 3");
		evaluate("1+2*3");
		evaluate("1 + (2 * 3)");
		evaluate("(1 + 2) * 3");
		evaluate("1 * 2 + 3");
		evaluate("(1 * 2) + 3");
		evaluate("1 * (2 + 3)");
		evaluate("a := 5 + 2; a * a + 3");
		assertThrows(IllegalArgumentException.class, () -> evaluate("a * a + 3"));
	}

	public void evaluate(String input) throws ParseException {
		System.out.println("Raw: " + input);
		try {
			Expression<Integer> e = ExpressionParser.parse(input);
			System.out.println("Parsed: " + e.visit(new FancyExtendedToString<>()).apply(new StringBuilder()));
			System.out.println("Evaluated: " + e.visit(new FancyIntegerAlgebra()).apply(FunctionMap.empty()));
		} catch (ParseException e) {
			System.out.println("Error parsing: " + e.getMessage());
			throw e;
		} catch (IllegalArgumentException e) {
			System.out.println("Failed evaluating");
			throw e;
		}
	}

We set up a number of tests (l. 2-14) all using a single helper method (ll. 17-30). The helper method prints out the given expression (l. 18) and tries parsing it using our generated parser (l. 20). If successful, it prints out the expression using the FancyExtendedToString developed in part 1 (l. 21), and finally it computes the value of the expression using the FancyIntegerAlgebra from the same part (l. 22). Parsing can fail with a ParseException if we provide invalid input (e.g., l. 5) or during evaluation with an IllegalArgumentException if we use an undefined variable (e.g., l. 14).

We test different kinds of expressions with different spacing and use of parentheses and see they all yield the expected results.

Syntactic Sugar

With our parser good to go, we could in principle call it quits. We can also implement a bit of syntactic sugar that makes our language more useful at the parser level, without having to define new expression classes or visitors. Let’s start out simple by defining fancy numbers. Java made a big deal of Java 11 supporting using underscores in numbers to make then easier to read (e.g., 1_000_000). Java also supports hexadecimal numbers by prefixing them by 0x. Let’s implement that:

TOKEN :
{
  < ID : ([ "a"-"z", "A"-"Z" ])+ >
| < NUMBER :
    (
      ("-")? ([ "0"-"9" ])+
    ) >
| < FANCY_NUMBER :
    (
      ("-")? ([ "_", "0"-"9" ])+
    ) >
| < HEX_NUMBER :
    (
      "0x" ([ "_", "0"-"9", "a"-"f", "A"-"F" ])+
    ) >
}

int Constant() :
{
  Token t;
}
{
  t = < NUMBER >
  {
    return Integer.parseInt(t.image);
  }
| t = < FANCY_NUMBER >
  {
    return Integer.parseInt(t.image.replaceAll("_", ""));
  }
| t = < HEX_NUMBER >
  {
    return Integer.parseInt(t.image.substring(2).replaceAll("_", ""), 16);
  }
}

We add two new tokens, FANCY_NUMBER and HEX_NUMBER to represent the new number types. We also add two more alternatives to the Constant production. The first new alternative matches a FANCY_NUMBER and removes all underscores before converting it to a number. The second matches a HEX_NUMBER, cuts off the 0x part, removes all underscores, and converts it to a number using base16. Now, we can use the new numbers like this:

		evaluate("1_000_000");
		evaluate("0x1234");
		evaluate("0xBABE + 1_000");

We see that now the parsed data starts looking quite different from the input. 1_000_000 is converted to 1000000 as we’d expect, and 0x124 is converted to the digital number 4660. These fancy numbers can be used anywhere a normal number can be used, so 0xBABE + 1_000 is translated to 47806 + 1000, which evaluates correctly to 48806. The original NUMBER is now in principle superfluous and could be removed.

We can also make more fundamental changes than just messing with tokens. For example, in part 3 we defined a square function in various ways. We can do something similar here, but let’s first look at implementing a difference operation. Our expression language has no support for that, but it is easy to see that “a – b” is the same as “a + (-1 * b)” and this is easily implemented by slightly changing the Expression production:

Expression < Integer > Expression() :
{
  Expression < Integer > a, b;
}
{
  a = Term()
  (
    < PLUS > b = Term()
    {
      a = new Plus < Integer > (a, b);
    }
  | "-" b = Term()
    {
      a = new Plus < Integer > (a, new Times < Integer > (new Constant < Integer > (- 1), b));
    }
  )*
  {
    return a;
  }
}

Only lines 12-15 are new and define an alternative matching a minus sign (l. 12). We should have used a token but for brevity of presentation in this post, it is instead a constant string. The magic happens in line 14 where we produce a more complex expression than what we are given exactly. Since minus and plus have the same precedence we do not need to introduce another level of productions.

We can use this new feature as:

		evaluate("7 - 5");
		evaluate("7 - (5 + 3 * 2)");
		evaluate("a := 5; 7 - a");

While it seems to the end-user as if minus is a native operation, our evaluator visitor only sees the operations it is accustomed to: plus and times, and evaluate everything as we would expect without further changes.

Finally, let us implement the square function from part 3 in the parser as well. This is a bit more complex for two reasons: we want the syntax to be “<expression>^2” but do not support, e.g., “<expression>^3”, and exponentiation lives at a different level of precedence. We therefore introduce a new TopLevel production taking the place of Expression everywhere it is used (including, but not shown, in Factor and Top):

Expression < Integer > TopLevel() :
{
  int c;
  Expression < Integer > a;
}
{
  a = Expression()
  (
    "^" c = Constant()
    {
      if (c != 2)
      {
        throw new ParseException("Cannot exponentiate to powers other than 2 (got " + c + " at line " + token.beginLine + ", column " + token.beginColumn + ")");
      }
      a = new Assignment < Integer > ("a", a, new Times < Integer > (new Variable < Integer > ("a"), new Variable < Integer > ("a")));
    }
  )*
  {
    return a;
  }
}

Most of this should be relatively standard by now; we define a TopLevel as an Expression followed by any number of exponentiations. Instead of allowing any expression as the exponent, we only allow a Constant c (l. 9). We manually check that the constant is indeed 2 and throw an exception otherwise (ll. 11-14). Like before, we wrap the Expression a in a multiplication with itself using an assignment to avoid evaluating it more than once.

This can now be used as:

		evaluate("5^2");
		evaluate("(5^2)+2");
		assertThrows(ParseException.class, () -> evaluate("5^2+2"));
		assertThrows(ParseException.class, () -> evaluate("5^3"));
		assertThrows(ParseException.class, () -> evaluate("5^22"));
		evaluate("5^0x2");

It might be surprising that line 3 fails, but this is due to where we have put the expression in the hierarchy. We also see that cubing is disallowed, as is exponentiation to anything other than 2. Exponentiation does work with our fancy numbers, though.

Exercise: consider moving the exponentiation around so “5^2+2” becomes valid. Consider where it really makes sense it lives.

Exercise: extend the parser to also handle “++<expression>” and “<expression>++” as a shorthand for “<expression> + 1”. You cannot use the existing PLUS two times, but will need a new token PLUSPLUS for “++”. The first one is much easier to get working than the second (why?).

Exercise: extend the parser to handle function definition and application as well. Make up your own meaningful (and easily parsable) syntax.

That does it for this short appendix. I strongly recommend using a proper parser generator instead of String.split or regular expressions. Users will be happy that they don’t have to write XML/JSON/YAML files for simple tasks, and you can even make simple embedded DSLs aiding users in many ways. For a client, we are making a simple DSL that allows them to specify validation rules in almost-Dutch, which is much more natural than an XML file which was the alternative.

You can download the full parser specification and test file here. They also depend on the Continuation code developed in the first four parts (the file was slightly updated between part 4 and this appendix going live, so if you have already downloaded, download it again, otherwise it’s the same file):

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.