A library you can include in your Spark job to validate the counters and perform operations on success.
This software should be considered pre-alpha.
Why you should validate counters
Maybe you are really lucky and you never have intermitent outages or bugs in your code.
If you have accumulators for things like records processed or number of errors, its really easy to write bounds for these. Even if you don't have custom counters you can use Spark's built in metrics (bytes read, time, etc.) and by looking at historic values we can establish reasonable bounds. This can help catch jobs which fail to process some of your records. This is not a replacement for unit or integration testing.
How spark validation works
We store all of the metrics from each run along with all of the accumulators you pass in.
If a run is successful it will run your on success handler. If you just want to mark the run as success you can specify a file for spark validator to touch.
How to write your validation rules
How to build
sbt - Remember when it was called the simple build tool?
How to use
At the start of your Spark program once you have constructed your spark context call
import com.holdenkarau.spark_validator ... val rules = List( new AbsoluteValueRule(counter = "recordsRead", min=Some(1000), max=None). ...) val vc = new ValidationConf(counterPath, jobName, firstTime, rules) val vl = new Validation(vc) ... validator.validate()