A data source library to load data into Apache Spark using SQL DataFrames from IBM® Netezza® database. This data source library is implemented using Netezza external table mechanism to transfer data from the Netezza host system to the Apache Spark system optimally.
You can download the Netezza Connector for Apache Spark
assembly jars from here:
Apache Spark Version | Release # | Binary Location |
---|---|---|
1.5.2+ | v0.1.1 | [spark-netezza-assembly-0.1.1.jar] (https://github.com/SparkTC/spark-netezza/releases/download/v0.1.1/spark-netezza-assembly-0.1.1.jar) |
Spark-netezza connector is a registered package on spark-packages site, so it supports command line option --packages
to resolve maven dependencies. You can use spark-shell, spark-sql, pyspark or spark-submit based on your scenario to invoke spark-netezza connector. Dependency on netezza jdbc driver needs to be provided.
For example, for spark-shell:
spark-shell --packages com.ibm.SparkTC:spark-netezza_2.10:0.1.1 --driver-class-path /path/to/nzjdbc.jar
You can use spark-netezza connector via the Apache Spark Data Sources API in Scala, Java, Python or SQL, as follows:
import org.apache.spark.sql._
val sc = // existing SparkContext
val sqlContext = new SQLContext(sc)
val opts = Map("url" -> "jdbc:netezza://netezzahost:5480/database",
"user" -> "username",
"password" -> "password",
"dbtable" -> "tablename",
"numPartitions" -> "4")
val df = sqlContext.read.format("com.ibm.spark.netezza")
.options(opts).load()
import org.apache.spark.sql.*;
JavaSparkContext sc = // existing SparkContext
SQLContext sqlContext = new SQLContext(sc);
Map<String, String> opts = new HashMap<>();
opts.put("url", "jdbc:netezza://netezzahost:5480/database");
opts.put("user", "username");
opts.put("password", "password");
opts.put("dbtable", "tablename");
opts.put("numPartitions", "4");
DataFrame df = sqlContext.read()
.format("com.ibm.spark.netezza")
.options(opts)
.load();
from pyspark.sql import SQLContext
sc = # existing SparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read \
.format('com.ibm.spark.netezza') \
.option('url','jdbc:netezza://netezzahost:5480/database') \
.option('user','username') \
.option('password','password') \
.option('dbtable','tablename') \
.option('numPartitions','4') \
.load()
CREATE TABLE my_table
USING com.ibm.spark.netezza
OPTIONS (
url 'jdbc:netezza://netezzahost:5480/database',
user 'username',
password 'password',
dbtable 'tablename'
);
Configuration can be passed on DataFrame using option:
Name | Description |
---|---|
url | The url to connect with JDBC driver. |
user | Username to connect to the database. |
password | Password to connect to the database. |
dbtable | The table in the database to load. |
numPartitions | Number of partitions to specify to parallely execute data movement. |
spark-netezza build supports Scala 2.10 by default, if Scala 2.11 artifact is needed, please refer to Scala 2.11 or Version Cross Build
To generate regular binary, in the root directory run:
build/sbt package
To generate assembly jar, in the root directory run:
build/sbt assembly
The artifacts will be generated to:
spark-netezza/target/scala-{binary.version}/
To run the tests, in the root directory run:
build/sbt test
To build against scala 2.11, use '++' option with desired version number, for example:
build/sbt ++2.11.7 assembly
build/sbt ++2.11.7 test
To produce artifacts for both scala 2.10 and 2.11:
Start SBT:
build/sbt
Run in the SBT shell:
+ package
+ test
spark-netezza connector supports sbt-spark-package plugin, to publish to local ivy repository, run:
build/sbt spPublishLocal