hunters-ai / spark-adaptive-file-connector   1.0.0

Apache License 2.0 GitHub

Adaptive File Source Connector for Spark, optimised for reading from object stores

Scala versions: 2.12

Spark Adaptive File Connector

The library provides Spark Source to efficiently handle streaming from object store for files saved in dynamically changing paths, e.g. s3://my_bucket/my-data/2022/06/09/, that might be described as s3://my_bucket/my-data/{YYYY}/{MM}/{DD}/.

Usage

Dependency

The library is available on Maven Central. If you use SBT you can add it to your build.sbt:

libraryDependencies += "ai.hunters" %% "spark-adaptive-file-connector" % "1.0.0"

for Maven:

<dependency>
    <groupId>ai.hunters</groupId>
    <artifactId>spark-adaptive-file-connector_2.12</artifactId>
    <version>1.0.0</version>
</dependency>

or for other dependency manager.

Spark Application

This is the simplest example of using the connector in Scala application:

val readFileJsonStream = spark.readStream
      .format("dynamic-paths-file")
      .option("fileFormat", "json")
      .schema(yourSchema)
      .load("s3://my-bucket/prefix/{YYYY}/{MM}/{DD}")

Build

sbt "clean;compile"

Testing

sbt "clean;test"

Tutorial - quick run

The easiest way to start playing with this library is an integration test DynamicPathsFileStreamSourceProviderITest. It shows how to:

  • pass parameters to Spark to be able to use this source connector
  • check what paths have been read and how fast
  • recover from checkpoint
  • see output

When you use this library in your app please remember to register the DynamicPathsFileStreamSourceProvider in your DataSourceRegister. You can see example what was needed for tests in src/test/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister.

Video

An Advanced S3 Connector for Spark to Hunt for Cyber Attacks (Data&AI Summit, San Francisco 2022)