# Spark ML SOM (Self-Organizing Map)

**NEW**: cost function history can be retrieved in SOMTrainingSummary, in order to check convergence! (see Quickstart)

SparkML-SOM is the only available distributed implementation of Kohonen's Self-Organizing-Map algorithm built on top of Spark ML (the Dataset-based API of Spark MLlib) and fully compatible with Spark versions 2.2.0 and newer. It extends Spark's `Estimator`

and `Model`

classes.

- SparkML-SOM can be used as any other MLlib algorithm with a simple
`fit`

+`transform`

syntax - It is compatible with Datasets/DataFrames
- It can be integrated in a Spark ML Pipeline
- It leverages fast native linear algebra with BLAS

The implemented algorithm is the Kohonen batch algorithm, which is very close to the `org.apache.spark.ml.clustering.KMeans`

and `org.apache.spark.mllib.clustering.KMeans`

).

The same algorithm was implemented by one of my colleagues: https://github.com/TugdualSarazin/spark-clustering (project now maintained by C4E). This version is meant to be simpler to use and more concise, performant and compatible with Spark ML Pipelines and Datasets/DataFrames.

**This code will soon be integrated into the C4E clustering project**, so be sure to check out this project if you want to explore more clustering algorithms. In case you only need SOM, keep using this code which will remain independent and up-to-date.

## Quickstart

```
import xyz.florentforest.spark.ml.som.SOM
val data: DataFrame = ???
val som = new SOM()
.setHeight(20)
.setWidth(20)
val model = som.fit(data)
val summary = model.summary // training summary
val res: DataFrame = summary.predictions
// or predict on another dataset
val res: DataFrame = model.transform(otherData)
```

Retrieve the cost function history to check convergence:

```
val cost: Array[Double] = summary.objectiveHistory
println(cost.mkString("[", ",", "]"))
```

...now plot it easily in your favorite visualization tool!

## Installation

The sparkml-som artifact is available on Maven Central, so the quickest way to use this package in your projects is by simply adding this dependency line to sbt:

`"xyz.florentforest" %% "sparkml-som" % "0.2.1"`

An alternative way, if for example your want to modify this project, is to fork/clone the repository, compile it using sbt and publish it locally:

```
$ git clone git@github.com:FlorentF9/sparkml-som.git
$ cd sparkml-som
$ sbt publishLocal
```

Then, add the same dependency line to your sbt.

## Parameters

Self-organizing maps essentially depend on their topology, the neighborhood function and the neighborhood radius decay. The algorithm uses a temperature parameter that decays after each iteration and controls the neighborhood radius. It starts at a value

**Map grid topology**(`topology`

)- rectangular
*(default)*

- rectangular
**Height and width**:`height`

*(default=10)*,`width`

*(default=10)***Neighborhood kernel**(`neighborhoodKernel`

)- gaussian
*(default)* - rectangular window

- gaussian
**Temperature (or radius) decay**(`temperatureDecay`

)- exponential
*(default)* - linear

- exponential
**Initial and final temperatures**:`tMax`

*(default=10.0)*,`tMin`

*(default=1.0)***Maximum number of iterations**:`maxIter`

*(default=20)***Tolerance (for convergence)**:`tol`

*(default=1e-4)*

## Implementation details

The package depends only on spark (core, sql and mllib) and netlib for native linear algebra. It will use native BLAS libraries if possible. Because of classes and methods marked as private in spark, some utility and linear algebra code from spark had to be included into the project: *util.SchemaUtils*, *util.MLUtils* and *linalg.BLAS*. I kept the original license and tried to keep the code minimal with only the parts needed by SOM.

## To-dos

- Add hexagonal grid topology
- Add visualization capabilities
- I did not extend MLWritable/MLReadable yet, so the model cannot be saved or loaded. However, as all the parameteres are stored in the
`SOMModel.prototypes`

variable of type`Array[Vector]`

, it is straightforward to save the parameters into a file.