A Plugable DMN wrapper API for Spark at row level
Although Quality offers very high performance DQ and rule engines it's focus is more on auditability and speed then generality. DMN offers generality as it's focus, sacrificing auditability (the user must provide it) and performance over the more narrowly focussed Quality.
The aim of this project is to enable a Spark first runtime for DMN processing, allowing customisation of the runtime whilst abstracting from the underlying engine.
An API providing versioned serialization of a configurable dmn file set and plugable expression handling.
- You bring the .dmn files to the engine directly
- dmn.ContextPath interface allows creation of dmn.Contexts with which to run your engine of choice
- ContextProvider Expressions are configured as pairs of input column to context path
- The implementing DMN Runtime then processes this dmn.Context with either a DecisionService or all DecisionServices (implementation dependent)
- Finally, a ResultProcessor Expression converts a notional dmn.Result to the correct output type
The interfaces, and implementations, are necessarily built using Spark internal APIs to both squeeze performance out of serialization overhead AND to manage rule load time.
An actual DMN engine, this must be provided via an API implementation, currently kogito-4-spark is planned.
It also has no opinion on DMN version support, the user of the library is abstracted from the engine choice but that engine choice is still the determining factor of DMN version support.
Depend upon an implementation e.g. kogito-4-spark using the same approach as shim and Quality each runtime must be specified.
If you typically build on OSS but deploy to other runtimes the approach is therefore (for maven):
<properties>
<kogito4SparkVersion>0.1.3</kogito4SparkVersion>
<kogito4SparkTestPrefix>3.4.1.oss_</kogito4SparkTestPrefix>
<kogito4SparkRuntimePrefix>13.1.dbr_</kogito4SparkRuntimePrefix>
<sparkShortVersion>3.4</sparkShortVersion>
<scalaCompatVersion>2.12</scalaCompatVersion>
</properties>
<dependencies>
<dependency>
<groupId>com.sparkutils.</groupId>
<artifactId>kogito-4-spark_${kogito4SparkTestPrefix}${sparkShortVersion}_${scalaCompatVersion}</artifactId>
<version>${kogito4SparkVersion}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.sparkutils</groupId>
<artifactId>kogito-4-spark_${kogito4SparkRuntimePrefix}${sparkShortVersion}_${scalaCompatVersion}</artifactId>
<version>${kogito4SparkVersion}</version>
<scope>compile</scope>
</dependency>
</dependencies>
The "." at the end of the group id on the kogito4SparkTestPrefix is not a mistake and allows two versions of the same library to be used for different scopes. It is not advised to develop on a different version of Spark/Scala than you deploy to.
Please refer to the implementation documentation for supported runtimes.
The performance of DMN is dependent on it's engine but there are certain general limitations that need be called out:
- The DMN Engines do not, unlike Quality, operate as part of Spark
- Compilation, even if provided, will not be inlined with WholeStageCodeGen
- Expressions used by FEEL cannot be optimised by Spark
- Serialisation overhead to the implementations input types may unavoidable (dmn-4-spark mitigates this as far as possible but Strings, Structs etc. will require converison)
- The startup cost of the DMN Engine is inescapable (it may not matter for your usecase)
- Startup time is required for each executor core (this is only likely significant for small datasets / streaming with tight SLAs)
- dmn-4-spark attempts to mitigate this by caching implementating Engine runtimes (provided they are thread safe)
- dynamic clusters will recreate this runtime for each core used (Spark partition)
The below information is for illustrative purposes only, the test case, whilst representative of Quality rules and based on existing DMN usage, should be viewed as directional and is created only as a way to guage base differences of simple rule processing with Kogito. The more complicated a DMN set becomes the more likely the performance will be subject to Rete style optimisations and conversely the more pleasent editing rules in DMN may be over Quality rules (even after sparkutils/quality#74 or similar would be supported in Quality). It is not advised to make planning decisions or tool choice based on this test alone, rather evaluate based on the actual rules themselves, put simply: measure and YMMV.
That said, the simple test case (15 rules, each with 2 boolean tests and a total of 10 common subexpressions) used in quality_performance_tests found a performance difference of around 23x slower with kogito-4-spark:
This report view is against "json baseline in codegen" (plain Spark creating an array of boolean output), "json no forceEval in codegen compile evals false - extra config" (Qualitys typical audited output) and "json dmn codegen" for the kogito-4-spark run saving to an array of booleans.
NB 1: Although the average here is 0.65ms per row with Kogito other far more complex tests have shown 20ms per row, again YMMV and measurement is essential. NB 2: The overhead of saving a Quality style audit trail is significant, this would further slow down DMN related times e.g. in this report:
The bottom orange line is the default Spark returning an array of booleans, the middle green line is Quality's results and the top blue line is Spark with the simulated audit trail.