tupol / spark-tools

Executable Apache Spark Tools: Format Converter & SQL Processor

GitHub

Spark Tools

Maven Central   GitHub   Travis (.org)   Codecov   Javadocs   Gitter   Twitter  

Description

This project contains some basic runnable tools that can help with various tasks around a Spark based project.

The main tools available:

  • FormatConverter Converts any acceptable file format into a different file format, providing also partitioning support.
  • SimpleSqlProcessor Applies a given SQL to the input files which are being mapped into tables.
  • StreamingFormatConverter Converts any acceptable data stream format into a different data stream format, providing also partitioning support.
  • SimpleFileStreamingSqlProcessor Applies a given SQL to the input files streams which are being mapped into file output streams.

This project is also trying to create and encourage a friendly yet professional environment for developers to help each other, so please do no be shy and join through gitter, twitter, issue reports or pull requests.

Prerequisites

  • Java 8 or higher
  • Scala 2.11 or 2.12
  • Apache Spark 2.4.X

Getting Spark Tools

Spark Tools is published to Maven Central and Spark Packages:

where the latest artifacts can be found.

  • Group id / organization: org.tupol
  • Artifact id / name: spark-tools
  • Latest version is 0.4.0

Usage with SBT, adding a dependency to the latest version of tools to your sbt build definition file:

libraryDependencies += "org.tupol" %% "spark-tools" % "0.4.0"

Include this package in your Spark Applications using spark-shell or spark-submit

$SPARK_HOME/bin/spark-shell --packages org.tupol:spark-tools_2.11:0.4.0

What's new?

0.4.1-SNAPSHOT

  • Added StreamingFormatConverter
  • Added FileStreamingSqlProcessor, SimpleFileStreamingSqlProcessor
  • Bumped spark-utils dependency to 0.4.2
  • The project compiles with both Scala 2.11.12 and 2.12.12
  • Updated Apache Spark to 2.4.6
  • Updated delta.io to 0.6.1
  • Updated the spark-xml library to 0.10.0
  • Removed the com.databricks:spark-avro dependency, as avro support is now built into Apache Spark
  • Updated the spark-utils dependency to the latest available snapshot

For previous versions please consult the release notes.

License

This code is open source software licensed under the MIT License.