The current version is available for Scala 2.11. Support for Scala 2.10 could be added back, and 2.12 should be supported soon (via ammonium / Ammonite).
Table of contents
- Quick start
- Extra launcher options
- Comparison to alternatives
- Status / disclaimer
- Big data frameworks
- Scio / Beam
- Special commands / API
- Jupyter installation
- Compiling it
Simply run the
jupyter-scala script of this repository to install the kernel. Launch it with
--help to list available (non mandatory) options.
Once installed, the kernel should be listed by
jupyter kernelspec list.
Extra launcher options
Some options can be passed to the
jupyter-scala script / launcher.
- The kernel ID (
scala) can be changed with
--id custom(allows to install the kernel alongside already installed Scala kernels).
- The kernel name, that appears in the Jupyter Notebook UI, can be changed with
--name "Custom name".
- If a kernel with the same ID is already installed and should be erased, the
--forceoption should be specified.
Comparison to alternatives
There are already a few notebook UIs or Jupyter kernels for Scala out there:
- the ones originating from IScala,
- the ones originating from scala-notebook,
- the ones affiliated with Apache,
Compared to them, jupyter-scala aims at being versatile, allowing to add support for big data frameworks on-the-fly. It aims at building on the nice features of both Jupyter (alternative UIs, ...) and Ammonite - it is now based on a only slightly modified version of it (ammonium). Most of what can be done via notebooks can also be done in the console via ammonium (slightly modified Ammonite). jupyter-scala is not tied to specific versions of Spark - one can add support for a given version in a notebook, and support for another version in another notebook.
Status / disclaimer
jupyter-scala tries to build on top of both Jupyter and Ammonite. Both of them are quite used and well tested / reliable. The specific features of jupyter-scala (support for big data frameworks in particular) should be relied on with caution - some are just POC for now (support for Flink, Scio), others are a bit more used... in specific contexts (support for Spark, quite used on YARN at my current company, but whose status is unknown with other cluster managers).
Big data frameworks
Status: some specific uses (Spark on YARN) well tested in particular contexts (especially the previous version, the current one less so for now), others (Mesos, standalone clusters) unknown with the current code base
import $exclude.`org.slf4j:slf4j-log4j12`, $ivy.`org.slf4j:slf4j-nop:1.7.21` // for cleaner logs import $profile.`hadoop-2.6` import $ivy.`org.apache.spark::spark-sql:2.1.0` // adjust spark version - spark >= 2.0 import $ivy.`org.apache.hadoop:hadoop-aws:2.6.4` import $ivy.`org.jupyter-scala::spark:0.4.2` // for JupyterSparkSession (SparkSession aware of the jupyter-scala kernel) import org.apache.spark._ import org.apache.spark.sql._ import jupyter.spark.session._ val sparkSession = JupyterSparkSession.builder() // important - call this rather than SparkSession.builder() .jupyter() // this method must be called straightaway after builder() // .yarn("/etc/hadoop/conf") // optional, for Spark on YARN - argument is the Hadoop conf directory // .emr("2.6.4") // on AWS ElasticMapReduce, this adds aws-related to the spark jar list // .master("local") // change to "yarn-client" on YARN // .config("spark.executor.instances", "10") // .config("spark.executor.memory", "3g") // .config("spark.hadoop.fs.s3a.access.key", awsCredentials._1) // .config("spark.hadoop.fs.s3a.secret.key", awsCredentials._2) .appName("notebook") .getOrCreate()
SparkSessions should not be manually created. Only the ones from the
org.jupyter-scala::spark library are aware of the kernel, and setup the
SparkSession accordingly (passing it the loaded dependencies, the kernel build products, etc.).
Note that no Spark distribution is required to have the kernel work. In particular, on YARN, the call to
.yarn(...) above generates itself the so-called spark assembly (or list of JARs with Spark 2), that is (are) shipped to the driver and executors.
import $exclude.`org.slf4j:slf4j-log4j12`, $ivy.`org.slf4j:slf4j-nop:1.7.21`, $ivy.`org.slf4j:log4j-over-slf4j:1.7.21` // for cleaner logs import $ivy.`org.jupyter-scala::flink-yarn:0.4.2` import jupyter.flink._ addFlinkImports() sys.props("FLINK_CONF_DIR") = "/path/to/flink-conf-dir" // directory, should contain flink-conf.yaml interp.load.cp("/etc/hadoop/conf") val cluster = FlinkYarn( taskManagerCount = 2, jobManagerMemory = 2048, taskManagerMemory = 2048, name = "flink", extraDistDependencies = Seq( s"org.apache.hadoop:hadoop-aws:2.7.3" // required on AWS ElasticMapReduce ) ) val env = JupyterFlinkRemoteEnvironment(cluster.getJobManagerAddress)
Scio / Beam
import $ivy.`org.jupyter-scala::scio:0.4.2` import jupyter.scio._ import com.spotify.scio._ import com.spotify.scio.accumulators._ import com.spotify.scio.bigquery._ import com.spotify.scio.experimental._ val sc = JupyterScioContext( "runner" -> "DataflowPipelineRunner", "project" -> "jupyter-scala", "stagingLocation" -> "gs://bucket/staging" ).withGcpCredential("/path-to/credentials.json") // alternatively, set the env var GOOGLE_APPLICATION_CREDENTIALS to that path
Status: TODO! (nothing for now)
Special commands / API
Being based on a slightly modified version of Ammonite, jupyter-scala allows to
- add dependencies / repositories,
- manage pretty-printing,
- load external scripts, etc.
the same way Ammonite does, with the same API, described in its documentation.
It has some additions compared to it though:
One can exclude dependencies with, e.g.
org.slf4j:slf4j-log4j12 from subsequent dependency loading.
publish.html( """ <b>Foo</b> <div id="bar"></div> """ ) publish.png(png) // png: Array[Byte] publish.js( """ console.log("hey"); """ )
Like for big data frameworks, support for plotting libraries can be added on-the-fly during a notebook session.
import $ivy.`org.vegas-viz::vegas:0.3.8` import vegas._ Vegas("Country Pop"). withData( Seq( Map("country" -> "USA", "population" -> 314), Map("country" -> "UK", "population" -> 64), Map("country" -> "DK", "population" -> 80) ) ). encodeX("country", Nom). encodeY("population", Quant). mark(Bar). show
Additional Vegas samples with jupyter-scala notebook are here.
import $ivy.`org.plotly-scala::plotly-jupyter-scala:0.3.0` import plotly._ import plotly.element._ import plotly.layout._ import plotly.JupyterScala._ plotly.JupyterScala.init() val (x, y) = Seq( "Banana" -> 10, "Apple" -> 8, "Grapefruit" -> 5 ).unzip Bar(x, y).plot()
Check that you have Jupyter installed by running
jupyter --version. It should print a value >= 4.0. If it's not the case, a quick way of setting it up consists in installing the Anaconda Python distribution (or its lightweight counterpart, Miniconda), and then running
$ pip install jupyter
$ pip install --upgrade jupyter
jupyter --version should then print a value >= 4.0.
jupyter-scala uses the Scala interpreter of ammonium, a slightly modified Ammonite. The interaction with Jupyter (the Jupyter protocol, ZMQ concerns, etc.) are handled in a separate project, jupyter-kernel. In a way, jupyter-scala is just a bridge between these two projects.
The API as seen from a jupyter-scala session is defined in the
scala-api module, that itself depends on the
api module of jupyter-kernel. The core of the kernel is in the
scala module, in particular with an implementation of an
Interpreter for jupyter-kernel, and implementations of the interfaces / traits defined in
scala-api. It also has a third module,
scala-cli, which deals with command-line argument parsing, and launches the kernel itself. The launcher script just runs this third module.
Clone the sources:
$ git clone https://github.com/alexarchambault/jupyter-scala.git $ cd jupyter-scala
Compile and publish them:
$ sbt publishLocal
jupyter-scala script, and set
0.4.3-SNAPSHOT (the version being built / published locally). Install it:
$ ./jupyter-scala --id scala-develop --name "Scala (develop)" --force
If one wants to make changes to jupyter-kernel or ammonium, and test them via jupyter-scala, just clone their sources,
$ git clone https://github.com/alexarchambault/jupyter-kernel
$ git clone https://github.com/alexarchambault/ammonium
build them and publish them locally,
$ cd jupyter-kernel $ sbt publishLocal
$ cd ammonium $ sbt published/publishLocal
Then adjust the
jupyterKernelVersion in the
build.sbt of jupyter-scala (set them to
0.8.1-SNAPSHOT), reload the SBT compiling / publishing jupyter-scala (type
reload, or exit and relaunch it), and build / publish locally jupyter-scala again (
sbt publishLocal). That will make the locally published artifacts of jupyter-scala depend on the locally published ones of ammonium or jupyter-kernel.
Released under the Apache 2.0 license, see LICENSE for more details.