gm-spacagna / sparkz

An extension to the amazing Spark framework for better functional programming.

Version Matrix

sparkz

Build Status

A proof-of-concept extension to the amazing Spark framework for better functional programming. The project aims to extend, and in a few cases re-implement, some of the functionalities and classes in the Apache Spark framework.

The main motivation is to make statically typed the APIs of some Machine Learning components, to provide the missing functional structures of some classes (Broadcast variables, data validation pipelines, utility classes...) and to work around the unnecessary limitations imposed by private fields/methods. Moreover, the project introduces a bunch of util functions, implicits and tutorials to show the power, conciseness and elegance of the Spark framework when combined with a fully functional design.

Sonatype dependency

Maven:

<dependency>
  <groupId>com.github.gm-spacagna</groupId>
  <artifactId>sparkz_2.10</artifactId>
  <version>0.1.0</version>
</dependency>

sbt:

"com.github.gm-spacagna" % "sparkz_2.10" % "0.1.0"

Current features

WIP

  • Functor for Spark Broadcast

Limitations

The original Spark implementations are intentionally not fully functional in order to avoid overloading the garbage collector and have more efficient and mutable data structures. This project is only a proof-of-concept with the goal of inspiring developers, data scientists and engineers to think their design in pure functional terms but does not guarantee better performances. It is strongly encouraged to tailor and tune each component based on your speficif needs.

Related projects