This library is an ongoing effort towards bringing the data exchanging ability between Java/Scala and Python. PyJava introduces Apache Arrow as the exchanging data format.
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
XML data source for Spark SQL and DataFrames
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.
Distributed Matrix Library
A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm.
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
Spark library for easy MongoDB access
Custom state store providers for Apache Spark
Spark connector for SFTP
Infinispan Spark Connector
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Comet Data Pipeline is a Spark Based On Premise and Cloud Ingestion Framework for Batch & Streaming (Coming) Systems
Scala Library/REPL for Machine Learning Research
Arc is an opinionated framework for defining data pipelines which are predictable, repeatable and manageable.
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
ETL Library for Machine Learning - data pipelines, data munging and wrangling
Connect Spark to HBase for reading and writing data with ease
Easy access to big things. Library for Apache Spark extending and improving its capabilities
Spark RAPIDS plugin - accelerate Apache Spark with GPUs