A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm.
Boiler plate framework to use Spark and ZIO together.
Infinispan Spark Connector
XML data source for Spark SQL and DataFrames
Spark library for easy MongoDB access
Scala Library/REPL for Machine Learning Research
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark workload metrics data.
Custom state store providers for Apache Spark
Spark connector for SFTP
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
ETL Library for Machine Learning - data pipelines, data munging and wrangling
This library is an ongoing effort towards bringing the data exchanging ability between Java/Scala and Python. PyJava introduces Apache Arrow as the exchanging data format.
Connect Spark to HBase for reading and writing data with ease
SANSA Query Layer
Easy access to big things. Library for Apache Spark extending and improving its capabilities
The official Riak Spark Connector for Apache Spark with Riak TS and Riak KV
Comet Data Pipeline is a Spark Ingestion Framework for Batch (Hadoop) & Streaming (Coming) Systems
This project generalizes the Spark MLLIB Batch and Streaming K-Means clusterers in every practical way.
Approximate Nearest Neighbors in Spark