This is a very simple plugin focused on adding all the boilerplate that you need to configure a Spark application in SBT so you do not have to.
Add the following line to your project/plugins.sbt
file:
addSbtPlugin("com.github.alonsodomin" % "sbt-spark" % "x.y.z")
Then enable the plugin in your build.sbt
file:
enablePlugins(SparkPlugin)
Write your Spark app:
import org.apache.spark._
object SimpleSparkApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("Simple Spark Application")
.set("spark.logConf", "false")
val sc = new SparkContext(conf)
val count = sc.parallelize(Seq("Hello", "from", "Spark"), 1).count()
println(s"Count result: $count")
sc.stop()
}
}
And run it!
sbt run
If you want to package it so you can run it from your spark cluster, then use the assembly
command:
sbt assembly
Your application package should be found now under the target
folder of your local copy. And now you're all set, you are ready to start writing your awesome Spark application!
In versions prior to 0.4.0 the plugin used to be enabled by default by just adding the plugin dependency to the project. This didn't use to play well with multi-module setups since it led to the annoyance of having to disable it explicitly in all the modules that did not require it, including the root project.
Starting at 0.4.0, the plugin needs to be enabled explicitly, this means adding one single line to single-module projects (as stated in the Getting Started section) and allows users of multi-module setups to choose which modules do require the Spark features and which don't.
By default the plugin will use Spark 2.4.3
. If you want to use a different version just put the following in your build.sbt
:
sparkVersion := "1.6.3"
By default the plugin will only put spark-core
in your classpath. If you want to use any other additional Spark module just
use the following syntax in your build.sbt
file:
sparkComponents += "sql"
or
sparkComponents ++= Seq("sql", "mllib")
This is equivalent to having the following in your build.sbt
file:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion.value % Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion.value % Provided,
"org.apache.spark" %% "spark-mllib" % sparkVersion.value % Compile
)
As you can see, the plugin will also handle the dependency scope properly, meaning that the sql
component will be
put in the Provided
scope whilst the mllib
one will be packaged with your app.
Transitive dependencies brought by any of the Spark modules added to the project via the sparkComponents
setting can be excluded using the sparkExclusionRules
setting:
sparkExclusionRules += ExclusionRule(organization = "org.slf4j", name = "slf4j-log4j12")
That setting will affect as many of the Spark components you have configured previously (removing the boilerplate of having to add similar rules to each of your Spark dependencies).
Using the sparkComponentScope
key you can configure the actual dependency scope of each of the Spark modules:
sparkComponentScope += ("sql" -> Compile)
That will make the spark-sql
module be in the Compile
scope (and therefore making it part of the final assembly jar).
Although is not usually recommended to publish assembly jars into repositories like Nexus or Artifactory. In several places is considered acceptable with regards to Spark applications, since they are usually not the type of JAR file that others will depend upon.
This plugin will allow you to do so using either the publishLocal
or publish
SBT tasks, without any additional configuration. You should see how an artifact with a spark
classifier has been published to your repository. If interested on customizing this, you can use the following:
sparkClassifier := "myclassifier"
That will generate an assembly JAR file with the myclassifier
suffix.
sbt-spark
uses sbt-assembly
with some sensible defaults. To get a package that you can deploy in your Spark cluster,
just run sbt assembly
from the command line (as stated before).
If you need to customize your package, refer to sbt-assembly
's website, all the
configuration keys from it are available as if you where using the plugin yourself.
Very little in fact, Spark applications all have the same setup boilerplate:
- Add Spark dependencies and place them in the
provided
scope. - Configure
sbt-assembly
to package your application in an uberjar, which in itself means setting up the proper merge strategy and removing Scala's stdlib from the final artifact. - Re-configure SBT's
run
task so you still can run your app locally with the correct classpath.
It's a PITA to repeat this all over again every time you want to start a brand new Spark application, so sbt-spark
does it for you. Simple as that.
Well, it's not really that different, this could even be considered a slimmed down version of the same "utility" but just catering to a different audience.
sbt-spark-packages
is meant to be used by developers that want to write extensions on top of Spark, packages that other
Spark applications can use so it's very focused on giving you a good starting point plus a platform to ditribute your packages
to other users.
sbt-spark
is meant to be a boilerplate-free starting point to write Spark applications. It's main audience is Spark developers
that write end of the world Spark applications. Also, sbt-spark
could be useful to support tooling around writing Spark applications,
(i.e.: test harness libraries) which still require the same starting point, but not fit into the Spark package concept.
I'm a library author targeting different versions of Spark, can this plugin support "cross Spark compiling"?
Not yet but but it might be doable. It's also questionable if this the right project for it.
Copyright 2017 Antonio Alonso Dominguez
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.