A Scala regular expression implementation with derivative-based evaluation described in the paper Regular-expression derivatives re-examined by Scott Owens, John Reppy, and Aaron Turon.
Normalizing constructor approach inspired from this Scheme implementation
The main purpose is to provide a regex library that supports regular operations that are usually absent (e.g., intersection)
and avoids backtracking and extra-regular features (e.g., back matching) to avoid pathological cases.
While the Thompson caching NFA construction is a well-known way to achieve this for basic regex operators, the state space can get large and
complementation and intersection are expensive. It is not
necessary to construct an automaton to perform matching using derivatives, and deriving on complements and intersections
is trivial. Moreover, it can be seen that the DFA-construction
algorithm uses the RegexAST.derive
method to essentially discover all the unique derivation paths from the original expression,
which is much more work than deriving on a single input string in typical scenarios. For this reason, RE2DFA.apply
is
provided to produce a DFA from a RegexAST
but is not part of the default compilation chain in RegExpr.apply
. However,
string generation is more convenient to perform on the DFA, so it implements getStrings
which emits a Stream
of the strings
that the DFA accepts via a breadth-first search.
An overview of the library and more detailed writings about its design can be found here
- Typical *+?| operators, () for precedence, implicit concatenation
- Character classes:
- [a-zA-Z] syntax for ranges
- [^abc] for negation
- \d\D\w\W\s\Sv\V\h\H escape sequences for common classes
- Bounded repetition:
- r{n}
- r{i,}
- r{i,j}
- Intersection (&) and complementation (~) supported in regular expressions
- Compilation of regex string to an AST
- Direct matching of strings against ASTs
- derive(c: Char) can be used to rewrite the AST one input character at a time
- DFA construction from regular expression
- Matching of input strings against DFA
- GraphViz Dot representation of DFA which can be rendered with appropriate tooling
- String generation from DFA