ReactiveSax is a tiny, non-validating, reactive SAX parser for Scala. It does not implement the
org.xml.sax.Parser interface (as do SAX1 parsers), nor the
org.xml.sax.XMLReader interface (as do SAX2 parsers). Instead, it extends the
java.io.Writer class, and provides a
write method. When you write character data to the parser, it will be parsed character-by-character by the parser's state machine, and the provided
org.xml.sax.ContentHandler will receive SAX events as appropriate. After you write the data, nothing else happens until you write more data, at which point parsing will pick up where it left off.
ReactiveSax arose from a fundamental flaw with SAX that makes it incompatible with the reactive programming model. The idea of SAX is great for reactive programming in that it is a push model - events are pushed to the next thing in the pipeline, which does some processing and pushes stuff to the next thing, etc. Unfortunately, the
XMLReader interface defines a
parse method, which calls
parse up the chain until the actual parser is reached; that method takes an
InputSource parameter, which must provide an
InputStream. And therein lies the flaw - an InputStream is a blocking I/O, which is fundamentally a blocking pull. This goes against the push model of SAX and makes it unsuitable for reactive programming using
Task. If a bunch of parsers block all the threads waiting for input, and their ability to pull input relies on other bits of your program pushing data (into, let's say, a
java.io.PipedWriter) then you'll have a deadlock on your hands.
Frustrated with this small piece of bad decision making (which arose from a failure to predict non-blocking I/O as a thing), I came up with ReactiveSax.
This package was written to address my current needs only. There's a lot it doesn't do, and so to say it complies with any particular standard would be completely incorrect.
- It does no validation of any kind
- It will happily continue trying to parse a malformed document (except in cases of certain catastrophic parse errors)
- It does nothing with DTDs or other declarations like
- It doesn't process any entities besides
", and valid numeric entities (except to provide a
skippedEntity(name)event to the
- It produces no errors for an
ErrorHandler. In the catastrophic cases mentioned above, it throws an exception.
- It does no resolution of any external anything. It will never make a network request for any reason, nor attempt to resolve anything.
Pros / Features
- It is based on a deterministic finite state automaton
- So it's very fast
- And it needs very little memory
- It supports namespaces
- It only parses while it's being pushed input, so parsing can be paused at any time
- In other words, parsing will never block or wait for anything
The companion object
SAXPushParser has two
apply methods. Don't call the constructor of
To create an instance, if you already have the
ContentHandler you'll be using:
val handler = new MyContentHandler val sax = ReactiveSax(handler) sax.write("<xml></xml>")
This makes the
To create an instance, if you're planning to set the
ContentHandler later (this is discouraged):
val sax = ReactiveSax() //... sax.setContentHandler(handler) sax.write("<xml></xml>")
Internally, this creates an identity
XMLFilter and that it passes SAX events to, and
setContentHandler calls the same on that
XMLFilter. So it gives you some flexibility to set up your
ContentHandler later on, but you must remember to do so, and you have no guarantee that you won't be switching handlers around willy-nilly causing problems.
reactivesax is licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
These questions aren't frequently asked. In fact, nobody has ever asked me a question about this project. But these are some questions that you might be having if you're reading this README.
Why don't you implement the
XMLReaderinterface? How am I supposed to use this?
A: As mentioned above, you should call
writein order to parse some data. Implementing
XMLReaderwould require implementing the
parse(InputSource)method, which is fundamentally blocking/pull.
A: No - whenever data is written, it will try to process it. You shouldn't write data from multiple threads simultaneously. If you find this happening, you should be sequencing your
Futures more carefully. Typically, a thread that's getting character data from somewhere (a computation, or an event-driven input source, etc) will be writing that data out to the parser in order to pipe character input into a SAX pipeline.