This post has 2154 words. Reading it will take approximately 11 minutes.

Regular expressions allow us to describe families of strings in a user-friendly and intuitive way, if you’re that kind of person. For example, a regular expression can describe valid post codes, phone numbers, names, addresses, etc.

A regular expression can be translated into a finite automaton. This automaton can then be used to check whether a given string matches the expression. This functionality is implemented in most programming languages. Here we’ll look at Java. In Java, we can check whether a string

Pattern p = Pattern.compile("[a-zA-Z][a-zA-Z0-9_]*"); Matcher m = p.matcher("abc_123"); if (m.matches()) { }

There are simpler ways to do it, but we’ll need the full complexity above later. First, we translate the pattern into a finite automaton (using Pattern.compile) and then we execute the finite automaton on a given input. A pattern is stateless and can be shared among threads, while a matcher contains information about the concrete match.

What if we want to go the other way, though? Instead of testing whether a string matches a pattern, we want to generate a set of strings matching a pattern. This makes sense for generating test data; maybe we have a service which requires data matching a given pattern, and we want to generate guaranteed valid random test data, or maybe we need to generate 1000 plausible but random phone numbers/addresses/UUIDs.

From the computer science perspective, this is simple. Any property is decidable about finite automata, also generating random strings. I’ll not go into the details, but it is a couple of relatively simple graph coloring algorithms (mark all states from which an accepting state is reachable – this is basically the prefix automaton – including an interval of the minimum and maxiumum number of characters needed/allowed to get there, then start in the initial state and do a random walk ensuring to only stay on states from which an accepting state is reachable and ensuring that you stay within the length interval resulting in an output of the desired length). It may sound complicated, but this is first or at best second year basic computer science curriculum.

In Java, this is not so simple. The Pattern class exhibits nothing of the internal automaton structure (it has an entire internal hierarchy essentially doing this). There are libraries that claim to be able to do (probably) the above (Xeger and Generex), but neither has been updated in years and both depend on a third-party automaton library, which means that the regexp syntax they support is different from the one supported by the standard classes. Not really viable options.

I have also written my own regular expression library, and could easily implement the above algorithm, but this also suffers from the problem of having another syntax. I could expand the syntax to be compatible, but to be

Then, how do we make this work using the standard classes only? The simplest option would be to generate random strings, check if they match the regular expression and if so return them. This will work in the realms of theoretical computer science, but in the real world, this is bound to take far too long: if we want to generate random alphanumeric identifiers as in the regular expression above, generating random strings will be inefficient, as we only allow approximately 25% of the possible characters (63 out of 255). That means that to generate a 10 character string, we would have less than one in one million chance of succeeding, so we would expect to generate and test over one million random 10 character strings to get one we could use. This is bound to be extremely slow and will only get worse with longer strings.

Luckily, we can do a bit better. In the theoretical algorithm using finite automata, we were looking at valid prefixes of matches, maybe it would be possible to do something similar using the Java libraries? The good news is that the answer is yes.

The Matcher class above has more methods than just matches. Advanced users of the class will know of the

The first, hitEnd, indicates whether we hit the end of the string during the last match. For example, for a regular expression “a.*” tested on the string “

The second method,

This allows up to partition prefixes of potential matches into 4 categories:

matches | hitEnd | requireEnd | |

true | true | – | Potential prefix match, some expansions are valid |

true | false | – | Definite prefix match, any expansion is valid |

false | – | true | Not a prefix match, some expansions are valid |

false | – | false | Not a prefix match, no expansions are valid |

By exploiting this information, we can detect which prefixes have a greater chance of leading to valid matches. Results from the (true, false, -) category are guaranteed to lead to matches, results from (true, true, -) and (false, -, true) have potential to lead to matches, and matches from the (false, -, false) category can never lead to matches.

Now, we can change our algorithm to generate results one character at a time, only expanding ones that have the potential to lead to matches. I Java, this looks like:

protected String generate(Pattern p, String prefix, int min, int max) { if (min <= 0 && max >= 0 && p.matcher(prefix).matches()) return prefix; if (max <= 0) return null; List<Character> candidates = new ArrayList<>(); generateCandidates(candidates, p, prefix); Collections.shuffle(candidates); return testCandidates(p, prefix, min, max, candidates); } protected void generateCandidates(List<Character> candidates, Pattern p, String prefix) { for (char c = 0; c <= 255; c++) { Matcher m = p.matcher(prefix + c); if (m.matches() || m.hitEnd()) { candidates.add(c); } } } protected String testCandidates(Pattern p, String prefix, int min, int max, List<Character> candidates) { for (char candidate : candidates) { String match = generate(p, prefix + candidate, min - 1, max - 1); if (match != null) { return match; } } return null; }

We first test if we already have a match (ll. 2-3). We supply a minimum and maximum length of the match, and if the supplies prefix already matches and the minimum requires remaining length is 0 or less, we return it. We similarly check if we have already lost the chance of a match (ll. 4-5): if the maximum length is 0 or less, we cannot possible expand the current match to something valid, and just bail out.

We then start generating candidatesz for expansion (ll. 7-8, 13-20). We run thru all legal characters (l. 14) and generate a matcher for the expansion of the prefix with the generated character (l. 15). If we are not in the (false, -, false) category where no expansion to a valid match is possible, we add the character to the list of candidates (l. 17).

After generating matches, we

After this, we actually test the individual candidates (ll. 10, 22-30). We iterate over the (now

Running this code, will generate proper examples and relatively efficiently even. We can also imagine various alternatives that might be more or less efficient at the cost of making the results less random.

For example, in the code we just use the three categories that allow expansion to a valid match indiscriminately. Perhaps it would be beneficial to instead prefer matches that are guaranteed to yield matches? I tested this, and this was not the case for my examples.

We also notice that we check all 256 characters even though most of them are just garbage; for

For some regular expressions, the performance increase does come at the cost of completely avoiding some legal matches. In an example, I use the regular expression “[ \t\r\n]*([+-]?(0|[1-9][0-9]*))[ \t\r\n]*(/[ \t\r\n]*([+-]?(0|[1-9][0-9]*)))?(([,;<][ \t\r\n]*([+-]?(0|[1-9][0-9]*))[ \t\r\n]*(/[ \t\r\n]*([+-]?(0|[1-9][0-9]*)))?)*)”. It looks complicated, but it just matches numbers (matched by “([+-]?(0|[1-9][0-9]*))”), either alone os as fractions, separated by one either , or ; or < and a bit of whitespace. By generating matches preferring alphanumeric characters, I get matches like “9739670008928383957619515” and “2637170476003567477336636”. These are certainly legal, but are probably not typical as they are just very large numbers.

I can improve on this by relaxing what I consider first, to either just visible characters (32-127) or all 7 bit characters (0-127). This changes performance for this particular regular expression from 1.830 ms (compared to random time of 6.662 ms) to 2.580 ms or 3.261 ms. Now I get matches like “650 ;5<41/0,5278,3 ;7<8,9” which are much more realistic. This, of course, just postpones the problem: this regular expression relies on non-alphanumeric printable 7-bit characters, but not on accented characters, and it would be possible to make a similar expansion relying on accents that would never generate them.

Another observation is that real strings are not random. They contain mostly alphanumeric characters and then a bit of junk to spice things up. We can correct for this by adding alphanumeric characters to the set multiple times, increasing the probability they are chosen. This comes at a slight performance cost (0.875 ms vs 0.832 ms for pattern “sourcePort=(.+)”) but generates matches like “sourcePort=i9uæxyù¯NÁ¥co” instead of “sourcePort=íí 4ÅÃ§ %À¥÷”.

A final idea is to branch early instead of generating all candidates. This improves speed significantly, for example, the identifier expression completes in just 0.115 ms (compared to 0.381 ms for preferring alphanumeric and 1.439 ms for pure random). This method generates patterns purely random, and extending it to prioritize alphanumeric characters efficiently is less obvious than for the original algorithm, but can be done.

The final algorithm for generating strings

protected String generate(Pattern p, String prefix, int min, int max) { if (min <= 0 && max >= 0 && p.matcher(prefix).matches()) return prefix; if (max <= 0) return null; List<Character> candidates = new ArrayList<>(allCharacters); Collections.shuffle(candidates); for (char candidate : candidates) { Matcher m = p.matcher(prefix + candidate); if (m.matches() || m.hitEnd()) { String match = generate(p, prefix + candidate, min - 1, max - 1); if (match != null) { return match; } } } return null; }

The code seems

Time person of the year 2006, Nobel Peace Prize winner 2012.