This is an extension of Spark ML library (version 2.2.0) providing:
- Integrated service with a configurable classification and regression execution, cross-validation, and pre-processing.
- Several handy transformers and evaluators.
- Extension of classification and regression for the temporal domain mainly by two kernels (can be combined): a sliding window (delay line) and a reservoir computing network with various topologies and activiation functions.
- Convenient customizable pipeline execution.
- Summary evaluation metrics
Once you have the incal-spark_ml lib on your classpath you are ready to go. To conveniently launch Spark-ML based (command line) apps the SparkMLApp class with automatically created/injected resources: SparkSession and SparkMLService, can be used. You can explore and run the following examples demonstrating the basic functionality (all data is public):
- Simple classification - for Iris data set
- Classification with a custom Spark confing - for Iris data set
- Classification with cross-validation - for Iris data set
- Simple regression - for Abalone data set
as well as example classifications and regressions for temporal problems:
- Temporal classification with sliding window (delay line) - for EEG eye movement time series
- Temporal classification with a reservoir kernel - for EEG eye movement time series
- Temporal regression with a sliding window (delay line) - for S&P time series
- Temporal regression with a reservoir kernel - for S&P time series
and clustering:
- Simple clustering - for Iris data set
Note that time-series classifications (and predictions) using convolutional neural networks and LSTMs are served by InCal DL4J library.
All you need is Scala 2.11. To pull the library you have to add the following dependency to build.sbt
"org.in-cal" %% "incal-spark_ml" % "0.3.0"
or to pom.xml (if you use maven)
<dependency>
<groupId>org.in-cal</groupId>
<artifactId>incal-spark_ml_2.11</artifactId>
<version>0.3.0</version>
</dependency>
Development of this library has been significantly supported by a one-year MJFF Grant (2018-2019): Scalable Machine Learning And Reservoir Computing Platform for Analyzing Temporal Data Sets in the Context of Parkinson’s Disease and Biomedicine