BigLibrary

Build Status

WORA (Write Once Run Anywhere) framework

The Biglibrary has been designed as a wrapper for bigdata programs (currently implemented for SPARK). The library enables programmers 1) to customize execution for local and cluster modes, 2) functions for boiler plate code, and 3) guarantees deployable code. This project realizes the idea of executable pipelines that are unaware of the data. Biglibrary realize a bigdata program as a pair: 1) Actual job and 2) Test job. The actual job can be executed in a cluster using the ScriptDB. Few examples implemented using the BigLibrary are given below.

WordCount

object WordCount extends SequenceFileJob[InputAndOutput] {
  override def execute(argument: InputAndOutput)(implicit ec: ElectricSession) = {
    val session = ec.getSparkSession

    import session.implicits._
    val file = ec.text(argument.input)
    val words = file.flatMap(_.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+"))
    words
      .groupByKey(f => f)
      .count()
      .write
      .option("delimiter", "\t")
      .csv(argument.output)

  }

}

WordCountTest

class WordCountTest extends ElectricJobTest {
  test("wordcount test with spark") {
    val input = createFile {
      """
        hello world
        Zero world
        Some world
      """.stripMargin
    }
    val output = createTempPath()
    launch(WordCount, InputAndOutput(input, output))
    val lines = readFilesInDirectory(output, "part")
    lines should contain("hello\t1")
    lines should contain("world\t3")
  }
}

Developers

OSS + travis+ sbt