hammerlab / spark-tests

Utilities for writing tests that use Apache Spark.

GitHub

spark-tests

Build Status Coverage Status Maven Central

Utilities for writing tests that use Apache Spark.

SparkSuite: a SparkContext for each test suite

Add configuration options in subclasses using sparkConf(…), cf. KryoSparkSuite:

sparkConf(
  // Register this class as its own KryoRegistrator
  "spark.kryo.registrator"  getClass.getCanonicalName,
  "spark.serializer"  "org.apache.spark.serializer.KryoSerializer",
  "spark.kryo.referenceTracking"  referenceTracking.toString,
  "spark.kryo.registrationRequired"  registrationRequired.toString
)

PerCaseSuite: SparkContext for each test case

KryoSparkSuite

SparkSuite implementation that provides hooks for kryo-registration:

register(
  classOf[Foo],
  "org.foo.Bar",
  classOf[Bar]  new BarSerializer
)

Also useful for subclassing once per-project and filling in that project's default Kryo registrar, then having concrete tests subclass that; see cf. hammerlab/guacamole and hammerlab/pageant for examples.

Miscellaneous RDD / Job / Stage utilities

  • rdd.Util: make an RDD with specific elements in specific partitions.
  • NumJobsUtil: verify the number of Spark jobs that have been run.
  • RDDSerialization: interface that allows for verifying that performing a serialization+deserialization round-trip on an RDD results in the same RDD.