Spark Records is a data processing pattern with an associated lightweight, dependency-free framework for Apache Spark v2+ that enables:
Bulletproof data processing with Spark
Your jobs will never unpredictably fail midway due to data transformation bugs. Spark records give you predictable failure control through instant data quality checks performed on metrics automatically collected during job execution, without any additional querying.
Automatic row-level structured logging
Exceptions generated during job execution are automatically associated with the data that caused the exception, down to nested exception causes and full stack traces. If you need to reprocess data, you can trivially and efficiently choose to only process the failed inputs.
Lightning-fast root cause analysis
Get answers to any questions related to exceptions or warnings generated during job execution directly using SparkSQL or your favorite Spark DSL. Would you like to see the top 5 issues encountered during job execution with example source data and the line in your code that caused the problem? You can.
Spark Records has been tested with petabyte-scale data at Swoop. The library was extracted out of Swoop's production systems to share with the Spark community.
Just add the following to your
libraryDependencies in SBT:
resolvers += Resolver.bintrayRepo("swoop-inc", "maven") libraryDependencies += "com.swoop" %% "spark-records" % "<version>"
You can find all released versions here.
Contributions and feedback of any kind are welcome.
Build docs microsite
sbt "project docs" makeMicrosite
Run docs microsite locally (run under
jekyll serve -b /spark-records