CircleCICircleCI Sonatype Nexus (Releases)

Regex-Deriv

A Scala regular expression implementation with derivative-based evaluation described in the paper Regular-expression derivatives re-examined by Scott Owens, John Reppy, and Aaron Turon.

Normalizing constructor approach inspired from this Scheme implementation

Purpose

The main purpose is to provide a regex library that supports regular operations that are usually absent (e.g., intersection) and avoids backtracking and extra-regular features (e.g., back matching) to avoid pathological cases. While the Thompson caching NFA construction is a well-known way to achieve this for basic regex operators, the state space can get large and complementation and intersection are expensive. It is not necessary to construct an automaton to perform matching using derivatives, and deriving on complements and intersections is trivial. Moreover, it can be seen that the DFA-construction algorithm uses the RegexAST.derive method to essentially discover all the unique derivation paths from the original expression, which is much more work than deriving on a single input string in typical scenarios. For this reason, RE2DFA.apply is provided to produce a DFA from a RegexAST but is not part of the default compilation chain in RegExpr.apply. However, string generation is more convenient to perform on the DFA, so it implements getStrings which emits a Stream of the strings that the DFA accepts via a breadth-first search.

Documentation

An overview of the library and more detailed writings about its design can be found here

Features

  • Typical *+?| operators, () for precedence, implicit concatenation
  • Character classes:
    • [a-zA-Z] syntax for ranges
    • [^abc] for negation
    • \d\D\w\W\s\Sv\V\h\H escape sequences for common classes
  • Bounded repetition:
    • r{n}
    • r{i,}
    • r{i,j}
  • Intersection (&) and complementation (~) supported in regular expressions
  • Compilation of regex string to an AST
  • Direct matching of strings against ASTs
    • derive(c: Char) can be used to rewrite the AST one input character at a time
  • DFA construction from regular expression
    • Matching of input strings against DFA
    • GraphViz Dot representation of DFA which can be rendered with appropriate tooling
  • String generation from DFA