dylemma / xml-spac

Handle streaming XML data with declarative, composable parsers

Version Matrix

XML SPaC

API Docs: spac-core | xml-spac | json-spac | xml-spac-javax | xml-spac-fs2-data | json-spac-jackson | json-spac-fs2-data

Streaming Parser Combinators is a Scala library for building stream consumers in a declarative style, specialized for tree-like data types like XML and JSON.

It delegates to a backend of your choice to obtain a stream of events:

And provides you the ability to create event consumers that are:

  • Declarative - You write what you want to get, not how to get it
  • Immutable - Parsers may be shared and reused without worry
  • Composable - Combine and transform parsers to handle complex data structures
  • Fast - With minimal abstraction to get in the way, speed rivals any hand-written handler
  • Streaming - Parse huge XML/JSON documents from events, not a DOM

You can jump into a full tutorial, or check out the examples, but here's a taste of how you'd write a parser for a relatively-complex blog post XML structure:

val PostParser = (
  XmlParser.attr("date").map(LocalDate.parse(_, commentDateFormat)),
  Splitter.xml(* \ "author").as[Author].parseFirst,
  Splitter.xml(* \ "stats").as[Stats].parseFirst,
  Splitter.xml(* \ "body").text.parseFirst,
  Splitter.xml(* \ "comments" \ "comment").as[Comment].parseToList
).mapN(Post)

Get it!

Add (your choice of) the following to your build.sbt file:

libraryDependencies ++= Seq(
  "io.dylemma" %% "spac-core" % "0.9.1",         // core classes like Parser and Transformer
   
  "io.dylemma" %% "xml-spac" % "0.9.1",          // classes for XML-specific parsers
  "io.dylemma" %% "xml-spac-javax" % "0.9.1",    // XML parser backend using javax.xml.stream
  "io.dylemma" %% "xml-spac-fs2-data" % "0.9.1", // XML parser backend using fs2-data-xml

  "io.dylemma" %% "json-spac" % "0.9.1",         // classes for JSON-specific parsers
  "io.dylemma" %% "json-spac-jackson" % "0.9.1", // JSON parser backend using the Jackson library
  "io.dylemma" %% "json-spac-fs2-data" % "0.9.1" // JSON parser backend using fs2-data-json
)

Main Concepts

SPaC is about handling streams of events, possibly transforming that stream, and eventually consuming it.

  • Parser[In, Out] consumes a stream of In values, eventually producing an Out. Parsers are Applicative with respect to the Out type.
  • Transformer[In, Out] transfrorms a stream of In values to a stream of Out values.
  • Splitter[In, Context] splits a stream of In events by selecting "substreams", e.g. only the events associated with some child element in the XML, or for a specific JSON field. Each substream is identified by a Context value. By attaching a Context => Parser[In, Out] function to each substream, you can create a Transformer[In, Out].

Instances of Transformer, Parser, and Splitter are immutable, meaning they can safely be reused and shared at any time, even between multiple threads. It's common to define an implicit val fooParser: XmlParser[Foo] = /* ... */

Example

XmlParser.attr("foo") is a parser which will find the "foo" attribute of the first element it sees.

<!-- file: elem.xml -->
<elem foo="bar" />
val xml = new File("elem.xml")
val elemFooParser: XmlParser[String] = XmlParser.attr("foo")
val result: String = elemFooParser.parse(xml)
assert(result == "bar")

Suppose you have some XML with a bunch of <elem foo="..."/> and you want the "foo" attribute from each of them. This is a job for a Splitter. You write an XmlSplitter sort of like an XPATH, to describe how to get to each element that you want to parse.

With the XML below, we want to parse the <root> element, since it represents the entire file. We'll write our splitter by using the * matcher (representing the current element), then selecting <elem> elements that are its direct children, using * \ elem.

<!-- file: root.xml -->
<root>
  <elem foo="bar" />
  <elem foo="baz" />
</root>
val xml = new File("root.xml")
val splitter: XmlSplitter[Unit] = Splitter.xml(* \ "elem")
val transformer: XmlTransformer[String] = splitter.joinBy(elemFooParser)

val rootParser: XmlParser[List[String]] = transformer.parseToList
val root: List[String] = rootParser.parse(xml)
assert(root == List("bar", "baz"))

Note that a Splitter has a handful of "attach a parser" methods. The one you use will depend on whether the parser is available implicitly, and whether you care about the Context value for each substream.

  • splitter.map(context => getParser(context))
  • splitter.joinBy(parser)
  • splitter.as[Out]

Running Parsers

To run a Parser[In, Out], use either its parse or parseF method.

These methods accept any source that belongs to the Parsable typeclass, where parse uses blocking operations (more suitable for when the source is a String or a java.io.Reader/InputStream), and parseF suspends the evaluation in an F[_] effect type.

The following source types are supported by default:

  • Iterable[In]
  • cats.data.Chain[In],
  • fs2.Stream[F, In] (as long as F belongs to the cats.effect.Sync typeclass, or F = fs2.Fallible)

Additional source types are supported via imports from a specific "support module", where you choose a "parser backend" to wrap/convert the underlying source to a stream that the SPaC parser can understand:

  • String (provided by the various support modules) i.e. some raw XML or JSON
  • java.io.File (provided by the Javax/Jackson support modules for XML/JSON respectively)
  • cats.effect.Resource[F, java.io.InputStream] (provided by the Javax/Jackson support modules for XML/JSON respectively)
  • cats.effect.Resource[F, java.io.Reader] (provided by the Javax/Jackson support modules for XML/JSON respectively)
  • fs2.Stream[F, Char] (provided by the fs2-data support modules)
  • fs2.Stream[F, Byte] (provided by the fs2-data support modules, as long as you import an appropriate implicit from fs2.data.text)
  • fs2.Stream[F, fs2.data.xml.XmlEvent] (provided by the fs2-data-xml support module)
  • fs2.Stream[F, fs2.data.json.Token] (provided by the fs2-data-json support module)

Note that the "support modules" are expressed in code as a pair of objects with the naming convention FooSupport and FooSource, where FooSupport defines implicits that contribute to the Parsable typeclass, and FooSource is a utility for constructing fs2.Stream[F, io.dylemma.spac.xml.XmlEvent] or fs2.Stream[F, io.dylemma.spac.json.JsonEvent] from some underlying source.

Applying Transformers

A Transformer[In, Out] is applied by:

  • .transform to convert an Iterator[In] to an Iterator[Out]
  • .toPipe[F] to convert an fs2.Stream[F, In] to an fs2.Stream[F, Out]

Under the Hood

Parser and Transformer both act as factories for their respective Handler traits. Whenever you consume a stream with a Parser's parse method, or transform a stream with a Transformer, a new Handler instance will be created and used to run the stream processing logic.

Handlers are internally-mutable, whereas the Parsers/Transformers that create them are not.

A Parser's Handler must respond to each input by either returning a result or a new handler representing the continuation of the parser logic. Once the stream of inputs ends, the handler must produce a result. Since Handlers are internally-mutable, it's acceptable (and even preferable) for the Handler to simply update its internal state and return a reference to itself instead of constructing a new handler.

object Parser {
  trait Handler[-In, +Out] {
    def step(in: In): Either[Out, Handler[In, Out]]
    def finish(): Out
  }
}

A Transformer's Handler may respond to each input by any combination of; updating its internal state, emitting an output to a downstream handler, or signalling to the upstream that it no longer wants to receive any more inputs. It may also choose to emit some additional outputs in response to the end of the stream of inputs. It may also transform exceptions thrown by a downstream handler.

object Transformer {
  trait Handler[-In, +Out] {
    def push(in: In, out: HandlerWrite[Out]): Signal
    def finish(out: HandlerWrite[Out]): Unit
    def bubbleUp(err: Throwable): Nothing = throw err
  }
}