Spark Records

Spark Records is a data processing pattern with an associated lightweight, dependency-free framework for Apache Spark v2+ that enables:

Bulletproof data processing with Spark
Your jobs will never unpredictably fail midway due to data transformation bugs. Spark records give you predictable failure control through instant data quality checks performed on metrics automatically collected during job execution, without any additional querying.
Automatic row-level structured logging
Exceptions generated during job execution are automatically associated with the data that caused the exception, down to nested exception causes and full stack traces. If you need to reprocess data, you can trivially and efficiently choose to only process the failed inputs.
Lightning-fast root cause analysis
Get answers to any questions related to exceptions or warnings generated during job execution directly using SparkSQL or your favorite Spark DSL. Would you like to see the top 5 issues encountered during job execution with example source data and the line in your code that caused the problem? You can.

Spark Records has been tested with petabyte-scale data at Swoop. The library was extracted out of Swoop's production systems to share with the Spark community.

See the documentation for more information or watch the Spark Summit talk (slides).

Installation

Just add the following to your libraryDependencies in SBT:

resolvers += Resolver.bintrayRepo("swoop-inc", "maven")

libraryDependencies += "com.swoop" %% "spark-records" % "<version>"

You can find all released versions here.

Community

Contributions and feedback of any kind are welcome.

Spark Records is maintained by Sim Simeonov and the team at Swoop.

Special thanks to Reynold Xin and Michael Armbrust for many interesting conversations about better ways to use Spark.

Development

Build docs microsite

sbt "project docs" makeMicrosite

Run docs microsite locally (run under target/site folder)

jekyll serve -b /spark-records

More details

swoop-inc / spark-records 3.0.1

Spark Records

Installation

Community

Development

License

Statistics

Commit Activity

3 Dependencies

No Dependent