WASP - Wide Analytics Streaming Platform

Official documentation website : Wasp documentation

Install

Wasp final releases are published to maven central through sonatype, snapshot releases are published to sonatype maven-snapshots.

libraryDependencies += "it.agilelab" %% "wasp-consumers-spark" % "3.0.1"

or (for snapshots)

ThisBuild / resolvers += "Sonatype Central Snapshots" at "https://central.sonatype.com/repository/maven-snapshots"
libraryDependencies += "it.agilelab" %% "wasp-consumers-spark" % "3.1.0-SNAPSHOT"

To download Wasp snapshot releases you need a personal access token or a project access token with read_api scope and role Developer to GitLab Wasp project.

To get a token, open a github issue.

WASP - Wide Analytics Streaming Platform
- Table of contents
  - General
- Leverage Wasp in your project

General

Overview

WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark.

If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you. If you need a point and click product, this is not the tool for you.

WASP is a big data framework that allows you to not waste time with devops architectures and integrating different components. WASP lets you focus on your data, business logic and algorithms, without worrying about tipical big data problems like:

at least once or exactly once delivery
periodically training a machine learning model
publishing your results in real time, to be reactive
applying schemas to unstructured data
feeding different datastores from the same data flow in a safe way
etc.

For more technical documentation, head to the documentation folder.

WASP in the wild

WASP has been added to Cloudera Solution Gallery as an Open Source tool to simplify streaming workflows.

You can see it here!

Background

Handling huge streams of data in near real time is a hard task. So we want to build a reference architecture to speed up fast data application development and to avoid common mistakes about fault tolerance and reliability. Kafka is the central pillar of the architecture and helps to handle streams in the correct way. We have been inspired by the Kappa architecture definition.

Architecture

You can refer to the diagrams (Wasp1 and Wasp2) to gain a general overview of the architecture. The project is divided into sub modules:

wasp-core: provides all basic functionalities, pojo and utilities
wasp-master: it provides the main entry point to control your application, exposing the WASP REST API. In the future, this will also provide a complete web application for monitoring and configuration.
wasp-producers: a thin layer to easily expose endpoints for ingestion purposes. Leveraging Akka-Camel we can provide Http, Tcp, ActiveMQ, JMS, File and many other connectors. This ingestion layer pushes data into Kafka.
wasp-consumers-spark: the consumer layer incapsulates Spark Streaming to dequeue data from Kafka, apply business logic to it and then push the output on a target system.

All the components are coordinated, monitored and owned by an Akka Cluster layer, that provides scalability and fault tolerance for each component. For example you can spawn multiple identical producers to balance the load on your http endpoint, and then fairly distribute the data on Kafka.

Glossary

Pipegraph: a directed acyclic graph of data transformations. Each step is lazy and loosely coupled from previous and the next one. It is basically an ordered list of ETL blocks, with Inputs and Outputs.
ETL: represents a Spark Streaming job. It can consume data from one or more Inputs, elaborate the incoming data and push it to an Output. You can't have more than an Output for an ETL block, in order to avoid misalignment between outputs. If you want to write the same data on different datastores, you must consume the topic data with two different ETL blocks. Both Streaming and Batch ETLs are supported.
Input: a source of data for an ETL block.
Output: a destination for data produced by an ETL block. Can be any of various datastores or messaging systems.
Topic: the representation of a Kafka topic with an associated Avro schema. Can be either an Input or an Output.
Index: the representation of an index in an indexed datastore (either ElasticSearch or Solr) adn its associated schema. Can be either an Input or an Output.
KVStore: an abstraction for a Key-Value store, like Cassandra and HBase, for when you need high performance access by key. Can only be used as an Output. This is not implemented yet.
OLAP: an abstraction for an Online Analytical Processing system. It will help to provide OLAP capabilities to the application. Druid and Kylin will be the available options. This is not implemented yet.
Raw: any of a number of datastores based on files; for example, HDFS or S3. Can be either an Input or an Output.
Producer: Producers are independent from pipegraphs. They ingest data from different sources and write data to a Kafka topic, after formatting it according to a the schema.

Services

Kafka

Kafka is the central element of this architecture blue print. Each topic must have an associated Avro schema. This enforces type consistency and is the first step towards reliable real time data quality, something we will work on in the next future. Avro has been chosen because more typed and descriptive than JSON and because of its compatibility with Spark and the Hadoop world in general. Kafka decouples the ingestion layer from the analysis one. This allows updating algorithms and models without impacting the ingestion layer, and vice versa.

Spark

Spark is the data engine powering WASP, and is used in two components: Streaming ETL and Batch ETL. It can also be used to provide a JDBC interface using Thrift server. WASP supports running Spark in three different ways:

embedded, using Spark's local mode, which is recommended for development only
on YARN, used when running with an existing Hadoop cluster
with Spark's standalone clustering (master + workers)

Akka

Akka is our middleware: each component of WASP is an actor and relies on a clustered Actor System. In this way each component can be a separate process, and even run on different machines, and we can handle fault tolerance in a trasparent way to the whole application. This is a general overview of the ActorSystem

MongoDB or Postgresql

MongoDB and Postgresql are used as the central repository for all configurations, ML models, and entities. It is fault tolerant and it simplifies the deployment in a distributed environment because each node just needs the MongoDB address to be ready to go.

Pluggable Datastore

WASP system is integrated with Elasticsearch, Solr, Kafka, HBase, Mongo, Jdbc Datasources and HDFS. All data stored inside the datastore is indexed and searchable via the specific query language of the datastore.

Using WASP

Setting up the development environment

WASP is written in Scala, and the build is managed with SBT.

The recommended development environment is Linux or MacOs; developing on Windows is certainly possible, but is not supported, sorry about that!

Before starting:

Install JDK between 8 and 17
Install SBT
Install Git

The steps to getting WASP up and running for development are pretty simple:

Clone this repository:

git clone https://github.com/agile-lab-dev/wasp.git
Run the unit tests:

sbt test

Leverage Wasp in your project

If you want to run a WASP based application you will need a multi-module structure similar to the whitelabel one:

whitelabel/
├── consumers-spark 
├── master
├── models
└── producers (*)

The dependencies between your project module and Wasp artifacts should be the following:

graph TD
  whitelabel-models:::white --> wasp-core
  whitelabel-producers:::white --> whitelabel-models
  whitelabel-producers:::white --> wasp-producers
  whitelabel-producers:::white --> wasp-repository
  whitelabel-producers:::white --> wasp-repository-*
  whitelabel-master:::white --> whitelabel-models
  whitelabel-master:::white --> wasp-master
  whitelabel-master:::white --> wasp-repository
  whitelabel-master:::white --> wasp-repository-*
  whitelabel-consumers-spark:::white --> whitelabel-models
  whitelabel-consumers-spark:::white --> wasp-consumers-spark
  wasp-consumers-spark -.-> spark
  wasp-consumers-spark -.-> hadoop
  wasp-core -.-> spark
  wasp-core -.-> hadoop
  classDef white fill:green

In green you see your modules while others are external dependencies. The dotted lines towards Hadoop and Spark mean that they are provided dependencies by Wasp that your project should treat as provided too and should install in the target environment, without packing them with your application.

Each Wasp actor system should be deployed as any Akka actor system, the only caveat is that consumers-spark Actor system needs also a special file named jars.list. Such file must contain a list of local dependencies that will be submitted as --jars in the programmatic Spark submit that is invoked by Wasp itself.

agile-lab-dev / wasp 3.2.0