autodeployai / pmml4s-spark

PMML scoring library for Spark as SparkML Transformer

GitHub

PMML4S-Spark

PMML4S-Spark is a PMML (Predictive Model Markup Language) scoring library for Spark as SparkML Transformer.

Features

PMML4S-Spark is the Spark wrapper of PMML4S, you can see PMML4S for details.

Prerequisites

  • Spark >= 2.0.0

Installation

PMML4S-Spark is available from maven central.

Latest release: Maven Central

SBT users
libraryDependencies += "org.pmml4s" %%  "pmml4s-spark" % "0.9.3"
Maven users
<dependency>
  <groupId>org.pmml4s</groupId>
  <artifactId>pmml4s-spark_${scala.version}</artifactId>
  <version>0.9.3</version>
</dependency>

Usage

  1. Load model.

    import scala.io.Source
    import org.pmml4s.model.Model
    import org.pmml4s.spark.ScoreModel
    
    // The main constructor accepts an object of org.pmml4s.model.Model
    val model = ScoreModel(Model(Source.fromURL(new java.net.URL("http://dmg.org/pmml/pmml_examples/KNIME_PMML_4.1_Examples/single_iris_dectree.xml"))))

    or

    import org.pmml4s.spark.ScoreModel
    
    // load model from those help methods, e.g. pathname, file object, a string, an array of bytes, or an input stream.
    val model = ScoreModel.fromFile("single_iris_dectree.xml")
  2. Call transform(dataset) to run a batch score against an input dataset.

    // The data is from http://dmg.org/pmml/pmml_examples/Iris.csv
    val df = spark.read.
      format("csv").
      options(Map("header" -> "true", "inferSchema" -> "true")).
      load("Iris.csv")
    
    val scoreDf = model.transform(df)
    scala> scoreDf.show(5)
    +------------+-----------+------------+-----------+-----------+---------------+-----------+-----------------------+---------------------------+--------------------------+-------+
    |sepal_length|sepal_width|petal_length|petal_width|      class|predicted_class|probability|probability_Iris-setosa|probability_Iris-versicolor|probability_Iris-virginica|node_id|
    +------------+-----------+------------+-----------+-----------+---------------+-----------+-----------------------+---------------------------+--------------------------+-------+
    |         5.1|        3.5|         1.4|        0.2|Iris-setosa|    Iris-setosa|        1.0|                    1.0|                        0.0|                       0.0|      1|
    |         4.9|        3.0|         1.4|        0.2|Iris-setosa|    Iris-setosa|        1.0|                    1.0|                        0.0|                       0.0|      1|
    |         4.7|        3.2|         1.3|        0.2|Iris-setosa|    Iris-setosa|        1.0|                    1.0|                        0.0|                       0.0|      1|
    |         4.6|        3.1|         1.5|        0.2|Iris-setosa|    Iris-setosa|        1.0|                    1.0|                        0.0|                       0.0|      1|
    |         5.0|        3.6|         1.4|        0.2|Iris-setosa|    Iris-setosa|        1.0|                    1.0|                        0.0|                       0.0|      1|
    +------------+-----------+------------+-----------+-----------+---------------+-----------+-----------------------+---------------------------+--------------------------+-------+
    only showing top 5 rows

Use in PySpark

See the PyPMML-Spark project.

Support

If you have any questions about the PMML4S-Spark library, please open issues on this repository.

Feedback and contributions to the project, no matter what kind, are always very welcome.

License

PMML4S-Spark is licensed under APL 2.0.