hammerlab / spark-tests

Utilities for writing tests that use Apache Spark.



Build Status Coverage Status Maven Central

Utilities for writing tests that use Apache Spark.

SparkSuite: a SparkContext for each test suite

Add configuration options in subclasses using sparkConf(…), cf. KryoSparkSuite:

  // Register this class as its own KryoRegistrator
  "spark.kryo.registrator"  getClass.getCanonicalName,
  "spark.serializer"  "org.apache.spark.serializer.KryoSerializer",
  "spark.kryo.referenceTracking"  referenceTracking.toString,
  "spark.kryo.registrationRequired"  registrationRequired.toString

PerCaseSuite: SparkContext for each test case


SparkSuite implementation that provides hooks for kryo-registration:

  classOf[Bar]  new BarSerializer

Also useful for subclassing once per-project and filling in that project's default Kryo registrar, then having concrete tests subclass that; see cf. hammerlab/guacamole and hammerlab/pageant for examples.

Miscellaneous RDD / Job / Stage utilities

  • rdd.Util: make an RDD with specific elements in specific partitions.
  • NumJobsUtil: verify the number of Spark jobs that have been run.
  • RDDSerialization: interface that allows for verifying that performing a serialization+deserialization round-trip on an RDD results in the same RDD.