This is an extension of Spark ML library (version 2.2.0) providing:
- Integrated service with a configurable classification and regression execution, cross-validation, and pre-processing.
- Several handy transformers and evaluators.
- Extension of classification and regression for the temporal domain mainly by two kernels (can be combined): a sliding window (delay line) and a reservoir computing network with various topologies and activiation functions.
- Convenient customizable pipeline execution.
- Summary evaluation metrics
All you need is Scala 2.11. To pull the library you need to add the following dependency to build.sbt
"org.in-cal" %% "incal-spark_ml" % "0.1.0"
or to pom.xml (if you use maven)
<dependency> <groupId>org.in-cal</groupId> <artifactId>incal-spark_ml_2.11</artifactId> <version>0.1.0</version> </dependency>
Once you have the incal-spark_ml lib on your classpath you are ready to go. To conveniently launch Spark-ML based (command line) apps the SparkMLApp class with automatically created/injected resources: SparkSession and SparkMLService, can be used. You can explore and run the following examples demonstrating the basic functionality (all data is public):
- Simple classification - for Iris data set
- Classification with a custom Spark confing - for Iris data set
- Classification with cross-validation - for Iris data set
- Simple regression - for Abalone data set
as well as example classifications and regressions for temporal problems:
- Temporal classification with sliding window (delay line) - for EEG eye movement time series
- Temporal classification with a reservoir kernel - for EEG eye movement time series
- Temporal regression with a sliding window (delay line) - for S&P time series
- Temporal regression with a reservoir kernel - for S&P time series
Note that time-series classifications (and predictions) using convolutional neural networks and LSTMs are served by InCal DL4J library.
Development of this library has been significantly supported by a one-year MJFF Grant (2018-2019): Scalable Machine Learning And Reservoir Computing Platform for Analyzing Temporal Data Sets in the Context of Parkinson’s Disease and Biomedicine