alonsodomin / sbt-spark   0.6.0

MIT License GitHub

Simple SBT plugin to configure Spark applications

Scala versions: 2.12 2.10
sbt plugins: 1.x 0.13

SBT Spark

Build Status Latest version

This is a very simple plugin focused on adding all the boilerplate that you need to configure a Spark application in SBT so you do not have to.

Getting started

Add the following line to your project/plugins.sbt file:

addSbtPlugin("com.github.alonsodomin" % "sbt-spark" % "x.y.z")

Then enable the plugin in your build.sbt file:

enablePlugins(SparkPlugin)

Write your Spark app:

import org.apache.spark._

object SimpleSparkApp {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setMaster("local[1]")
      .setAppName("Simple Spark Application")
      .set("spark.logConf", "false")

    val sc = new SparkContext(conf)
    val count = sc.parallelize(Seq("Hello", "from", "Spark"), 1).count()
    println(s"Count result: $count")

    sc.stop()
  }

}

And run it!

sbt run

If you want to package it so you can run it from your spark cluster, then use the assembly command:

sbt assembly

Your application package should be found now under the target folder of your local copy. And now you're all set, you are ready to start writing your awesome Spark application!

Migrating to 0.4.0

In versions prior to 0.4.0 the plugin used to be enabled by default by just adding the plugin dependency to the project. This didn't use to play well with multi-module setups since it led to the annoyance of having to disable it explicitly in all the modules that did not require it, including the root project.

Starting at 0.4.0, the plugin needs to be enabled explicitly, this means adding one single line to single-module projects (as stated in the Getting Started section) and allows users of multi-module setups to choose which modules do require the Spark features and which don't.

Usage

Choosing the Spark version:

By default the plugin will use Spark 2.4.3. If you want to use a different version just put the following in your build.sbt:

sparkVersion := "1.6.3"

Adding Spark components (modules) to my build

By default the plugin will only put spark-core in your classpath. If you want to use any other additional Spark module just use the following syntax in your build.sbt file:

sparkComponents += "sql"

or

sparkComponents ++= Seq("sql", "mllib")

This is equivalent to having the following in your build.sbt file:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core"  % sparkVersion.value % Provided,
  "org.apache.spark" %% "spark-sql"   % sparkVersion.value % Provided,
  "org.apache.spark" %% "spark-mllib" % sparkVersion.value % Compile
)

As you can see, the plugin will also handle the dependency scope properly, meaning that the sql component will be put in the Provided scope whilst the mllib one will be packaged with your app.

Exclude Spark transitive dependencies

Transitive dependencies brought by any of the Spark modules added to the project via the sparkComponents setting can be excluded using the sparkExclusionRules setting:

sparkExclusionRules += ExclusionRule(organization = "org.slf4j", name = "slf4j-log4j12")

That setting will affect as many of the Spark components you have configured previously (removing the boilerplate of having to add similar rules to each of your Spark dependencies).

Override scope for individual components

Using the sparkComponentScope key you can configure the actual dependency scope of each of the Spark modules:

sparkComponentScope += ("sql" -> Compile)

That will make the spark-sql module be in the Compile scope (and therefore making it part of the final assembly jar).

Publishing artifacts

Although is not usually recommended to publish assembly jars into repositories like Nexus or Artifactory. In several places is considered acceptable with regards to Spark applications, since they are usually not the type of JAR file that others will depend upon.

This plugin will allow you to do so using either the publishLocal or publish SBT tasks, without any additional configuration. You should see how an artifact with a spark classifier has been published to your repository. If interested on customizing this, you can use the following:

sparkClassifier := "myclassifier"

That will generate an assembly JAR file with the myclassifier suffix.

Customizing your application package

sbt-spark uses sbt-assembly with some sensible defaults. To get a package that you can deploy in your Spark cluster, just run sbt assembly from the command line (as stated before).

If you need to customize your package, refer to sbt-assembly's website, all the configuration keys from it are available as if you where using the plugin yourself.

FAQ

What does this plugin do?

Very little in fact, Spark applications all have the same setup boilerplate:

  • Add Spark dependencies and place them in the provided scope.
  • Configure sbt-assembly to package your application in an uberjar, which in itself means setting up the proper merge strategy and removing Scala's stdlib from the final artifact.
  • Re-configure SBT's run task so you still can run your app locally with the correct classpath.

It's a PITA to repeat this all over again every time you want to start a brand new Spark application, so sbt-spark does it for you. Simple as that.

How is this different from sbt-spark-packages

Well, it's not really that different, this could even be considered a slimmed down version of the same "utility" but just catering to a different audience.

sbt-spark-packages is meant to be used by developers that want to write extensions on top of Spark, packages that other Spark applications can use so it's very focused on giving you a good starting point plus a platform to ditribute your packages to other users.

sbt-spark is meant to be a boilerplate-free starting point to write Spark applications. It's main audience is Spark developers that write end of the world Spark applications. Also, sbt-spark could be useful to support tooling around writing Spark applications, (i.e.: test harness libraries) which still require the same starting point, but not fit into the Spark package concept.

I'm a library author targeting different versions of Spark, can this plugin support "cross Spark compiling"?

Not yet but but it might be doable. It's also questionable if this the right project for it.

MIT License

Copyright 2017 Antonio Alonso Dominguez

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.