Generating Test Data using Regular Expressions with Java

Regular expressions allow us to describe families of strings in a user-friendly and intuitive way, if you’re that kind of person. For example, a regular expression can describe valid post codes, phone numbers, names, addresses, etc.

A regular expression can be translated into a finite automaton. This automaton can then be used to check whether a given string matches the expression. This functionality is implemented in most programming languages. Here we’ll look at Java. In Java, we can check whether a string maches a regular expression using the below code snippet:

Pattern p = Pattern.compile("[a-zA-Z][a-zA-Z0-9_]*");
Matcher m = p.matcher("abc_123");
if (m.matches()) { }

There are simpler ways to do it, but we’ll need the full complexity above later. First, we translate the pattern into a finite automaton (using Pattern.compile) and then we execute the finite automaton on a given input. A pattern is stateless and can be shared among threads, while a matcher contains information about the concrete match.

What if we want to go the other way, though? Instead of testing whether a string matches a pattern, we want to generate a set of strings matching a pattern. This makes sense for generating test data; maybe we have a service which requires data matching a given pattern, and we want to generate guaranteed valid random test data, or maybe we need to generate 1000 plausible but random phone numbers/addresses/UUIDs.

From the computer science perspective, this is simple. Any property is decidable about finite automata, also generating random strings. I’ll not go into the details, but it is a couple of relatively simple graph coloring algorithms (mark all states from which an accepting state is reachable – this is basically the prefix automaton – including an interval of the minimum and maxiumum number of characters needed/allowed to get there, then start in the initial state and do a random walk ensuring to only stay on states from which an accepting state is reachable and ensuring that you stay within the length interval resulting in an output of the desired length). It may sound complicated, but this is first or at best second year basic computer science curriculum.

In Java, this is not so simple. The Pattern class exhibits nothing of the internal automaton structure (it has an entire internal hierarchy essentially doing this). There are libraries that claim to be able to do (probably) the above (Xeger and Generex), but neither has been updated in years and both depend on a third-party automaton library, which means that the regexp syntax they support is different from the one supported by the standard classes. Not really viable options.

I have also written my own regular expression library, and could easily implement the above algorithm, but this also suffers from the problem of having another syntax. I could expand the syntax to be compatible, but to be honest I can’t really be bothered.

Then, how do we make this work using the standard classes only? The simplest option would be to generate random strings, check if they match the regular expression and if so return them. This will work in the realms of theoretical computer science, but in the real world, this is bound to take far too long: if we want to generate random alphanumeric identifiers as in the regular expression above, generating random strings will be inefficient, as we only allow approximately 25% of the possible characters (63 out of 255). That means that to generate a 10 character string, we would have less than one in one million chance of succeeding, so we would expect to generate and test over one million random 10 character strings to get one we could use. This is bound to be extremely slow and will only get worse with longer strings.

Luckily, we can do a bit better. In the theoretical algorithm using finite automata, we were looking at valid prefixes of matches, maybe it would be possible to do something similar using the Java libraries? The good news is that the answer is yes.

The Matcher class above has more methods than just matches. Advanced users of the class will know of the groups commands that make up the bulk of the methods, but it also have two very obscure methods that will come in handy for us: hitEnd and requireEnd.

The first, hitEnd, indicates whether we hit the end of the string during the last match. For example, for a regular expression “a.*” tested on the string “abc“, we would get a valid match, but would also get that we did not hit the end; the regular expression is satisfied as long as we start with an a and cannot change afterwards. For the regular expression “[a-zA-Z][a-zA-Z0-9_]*” above, which matches valid identifiers, we would always match to the end: the entire input matters, and for an input “abd” we would hit the end of the input, because it matters wherther we continue with a “1” or a “%”. In other, words hitEnd indicates whether a positive match could change to a negative by adding more characters or a negative match could change to a positive.

The second method, requireEnd, indicates whether hitting the end was crucial for a negative match. This method makes no sense for positive matches. For example, the regular expression “[a-zA-Z][a-zA-Z0-9_]*” applied to “abc%def” would not match and return false on requireEnd as soon as we used the “%” we prevented matching further. The regular expression “abc” applied for “abcde” would also not match and yield false on requireEnd: as soon as we went past abc, we prevented a match in the future. Applied to “ab” however, we would get not match, but true on requireEnd: there is a possible expansion of the string that would make it a possible match (adding a “c” to get “abc“).

This allows up to partition prefixes of potential matches into 4 categories:

matcheshitEndrequireEnd
truetruePotential prefix match, some expansions are valid
truefalseDefinite prefix match, any expansion is valid
falsetrueNot a prefix match, some expansions are valid
falsefalseNot a prefix match, no expansions are valid

By exploiting this information, we can detect which prefixes have a greater chance of leading to valid matches. Results from the (true, false, -) category are guaranteed to lead to matches, results from (true, true, -) and (false, -, true) have potential to lead to matches, and matches from the (false, -, false) category can never lead to matches.

Now, we can change our algorithm to generate results one character at a time, only expanding ones that have the potential to lead to matches. I Java, this looks like:

protected String generate(Pattern p, String prefix, int min, int max) {
	if (min <= 0 && max >= 0 && p.matcher(prefix).matches())
		return prefix;
	if (max <= 0)
		return null;

	List<Character> candidates = new ArrayList<>();
	generateCandidates(candidates, p, prefix);
	Collections.shuffle(candidates);
	return testCandidates(p, prefix, min, max, candidates);
}

protected void generateCandidates(List<Character> candidates, Pattern p, String prefix) {
	for (char c = 0; c <= 255; c++) {
		Matcher m = p.matcher(prefix + c);
		if (m.matches() || m.hitEnd()) {
			candidates.add(c);
		}
	}
}

protected String testCandidates(Pattern p, String prefix, int min, int max, List<Character> candidates) {
	for (char candidate : candidates) {
		String match = generate(p, prefix + candidate, min - 1, max - 1);
		if (match != null) {
			return match;
		}
	}
	return null;
}

We first test if we already have a match (ll. 2-3). We supply a minimum and maximum length of the match, and if the supplies prefix already matches and the minimum requires remaining length is 0 or less, we return it. We similarly check if we have already lost the chance of a match (ll. 4-5): if the maximum length is 0 or less, we cannot possible expand the current match to something valid, and just bail out.

We then start generating candidatesz for expansion (ll. 7-8, 13-20). We run thru all legal characters (l. 14) and generate a matcher for the expansion of the prefix with the generated character (l. 15). If we are not in the (false, -, false) category where no expansion to a valid match is possible, we add the character to the list of candidates (l. 17).

After generating matches, we randomise them (l. 9). We do this after the generation instead of branching out immediately to ensure that it is possible to generate all possible matches (were we to instead branch out immediately we found a match, we would generate valid matches, but not random ones).

After this, we actually test the individual candidates (ll. 10, 22-30). We iterate over the (now randomised) candidates, and try to recursively generate a match using the prefix expanded with the new character with a remaining length of one shorter (l. 24). As soon as a match is found it is returned (ll. 25-27).

Running this code, will generate proper examples and relatively efficiently even. We can also imagine various alternatives that might be more or less efficient at the cost of making the results less random.

For example, in the code we just use the three categories that allow expansion to a valid match indiscriminately. Perhaps it would be beneficial to instead prefer matches that are guaranteed to yield matches? I tested this, and this was not the case for my examples.

We also notice that we check all 256 characters even though most of them are just garbage; for example all characters in the interval 128-255 are just drawing symbols and accented characters, which are unlikely to be used in most regular expressions. Similarly, characters 0-31 are mostly junk characters unlikely to show up in most real life examples. We could cut time significantly if we didn’t have to generate all 256 candidates for each character. I tested this and for the pattern to generate identifiers, I get a performance of approximately 0.381 ms by preferring alphanumeric characters compared to 1.439 ms by testing all candidates randomly when generating results of length at least 25. This even comes without a cost to the quality of the generated matches since they are all alphanumeric anyway (the underscore is part of the set of characters my implementation considers alphanumeric).

For some regular expressions, the performance increase does come at the cost of completely avoiding some legal matches. In an example, I use the regular expression “[ \t\r\n]*([+-]?(0|[1-9][0-9]*))[ \t\r\n]*(/[ \t\r\n]*([+-]?(0|[1-9][0-9]*)))?(([,;<][ \t\r\n]*([+-]?(0|[1-9][0-9]*))[ \t\r\n]*(/[ \t\r\n]*([+-]?(0|[1-9][0-9]*)))?)*)”. It looks complicated, but it just matches numbers (matched by “([+-]?(0|[1-9][0-9]*))”), either alone os as fractions, separated by one either , or ; or < and a bit of whitespace. By generating matches preferring alphanumeric characters, I get matches like “9739670008928383957619515” and “2637170476003567477336636”. These are certainly legal, but are probably not typical as they are just very large numbers.

I can improve on this by relaxing what I consider first, to either just visible characters (32-127) or all 7 bit characters (0-127). This changes performance for this particular regular expression from 1.830 ms (compared to random time of 6.662 ms) to 2.580 ms or 3.261 ms. Now I get matches like “650 ;5<41/0,5278,3 ;7<8,9” which are much more realistic. This, of course, just postpones the problem: this regular expression relies on non-alphanumeric printable 7-bit characters, but not on accented characters, and it would be possible to make a similar expansion relying on accents that would never generate them.

Another observation is that real strings are not random. They contain mostly alphanumeric characters and then a bit of junk to spice things up. We can correct for this by adding alphanumeric characters to the set multiple times, increasing the probability they are chosen. This comes at a slight performance cost (0.875 ms vs 0.832 ms for pattern “sourcePort=(.+)”) but generates matches like “sourcePort=i9uæxyù¯NÁ¥co” instead of “sourcePort=ˆíí 4Åç %À¥÷”.

A final idea is to branch early instead of generating all candidates. This improves speed significantly, for example, the identifier expression completes in just 0.115 ms (compared to 0.381 ms for preferring alphanumeric and 1.439 ms for pure random). This method generates patterns purely random, and extending it to prioritize alphanumeric characters efficiently is less obvious than for the original algorithm, but can be done.

The final algorithm for generating strings according to a regular expression using the built-in classes from Java thus becomes:

protected String generate(Pattern p, String prefix, int min, int max) {
	if (min <= 0 && max >= 0 && p.matcher(prefix).matches())
		return prefix;
	if (max <= 0)
		return null;

	List<Character> candidates = new ArrayList<>(allCharacters);
	Collections.shuffle(candidates);
	for (char candidate : candidates) {
		Matcher m = p.matcher(prefix + candidate);
		if (m.matches() || m.hitEnd()) {
			String match = generate(p, prefix + candidate, min - 1, max - 1);
			if (match != null) {
				return match;
			}
		}
	}
	return null;
}

The code seems simpler, but is a little less customisable (that’s the reason the original example used 3 mutually recursive functions). We make sure to randomise the search order initially (l. 8) to make sure we do not build in any bias.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.