Spark Pipeline Driver
The goal of this project is to make writing Spark applications easier by abstracting the logic into reusable components that can be compiled into a jar, stored on the cluster and then configured dynamically using external configurations such as JSON.
This section will attempt to provide a high level idea of how the framework achieves the project goals. Visit spark-pipeline-engine for more information.
There are several concepts that help achieve the project goal:
The step is the smallest unit of work in the application. A step is a single reusable code function that can be executed by a pipeline. There are two parts to a step, the actual function and the PipelineStep metadata. The function should define the parameters that are required to execute properly and the metadata is used by the pipeline to define how those parameters are populated. The return type may be anything including Unit, but it is recommended that a PipelineStepResponse be returned.
A pipeline is a collection of steps that should be executed in a predefined order. An application may execute one or more pipelines as part of an application and are useful when there may be a need to restart processing in an application without needing to run all of the same logic again.
An execution plan allows control over how pipelines are executed. An application may choose to only have a single execution that runs one or more pipelines or several executions that run pipelines in parallel or based on a dependency structure.
Drivers are the entry point into the application. The driver is responsible for processing the input parameters and initializing the DriverSetup for the application.
Each provided driver relies on the DriverSetup trait. Each project is required to implement this trait and pass the class name as a command line parameter. The driver will then call the different functions to begin processing data. Visit spark-pipeline-engine for more information.
The Application framework is a configuration based method of describing the Spark application. This includes defining the execution plan, pipelines, pipeline context overrides (pipeline listener, security manager, step mapper) and global values.
There are several sub-projects:
This project contains the core library and is the minimum requirement for any application.
This component contains steps that are considered generic enough to be used in any project.
This component contains drivers classes that connect to various streaming technologies like Kafka and Kinesis. Each class provides a basic implementation that gathers data and then initiates the Spark Pipeline Engine Component for processing of the incoming data.
This project provides several examples to help demonstrate how to use the library.
This project provides utilities that help work with the project.
Examples of building pipelines can be found in the pipeline-drivers-examples project.
The project is built using Apache Maven. To build the project using Scala 2.11 and Spark 2.3 run:
To build the project using Scala 2.11 and Spark 2.4 run:
mvn -Dspark.compat.version=2.4 -Djson4s.version=3.5.3 -Dspark.version=2.4.3
To build the project using Scala 2.12 and Spark 2.4 run:
mvn -Dspark.compat.version=2.4 -Djson4s.version=3.5.3 -Dspark.version=2.4.3 -Dscala.compat.version=2.12 -Dscala.version=2.12.8
(This will clean, build, test and package the jars and generate documentation)
Tests are part of the main build.
- Start by forking the main GutHub repository.
- Commit all changes to the develop branch.
- Create proper scaladoc comments for any new or changed functions.
- Provide a thorough unit test for the change.
- Provide any additional documentation required by individual projects.