Arc is an opinionated framework for defining data pipelines which are predictable, repeatable and manageable.
An implementation of DBSCAN runing on top of Apache Spark
Spark-based approximate nearest neighbor search using locality-sensitive hashing
C4E, a Scala or Spark library for local and distributed Clustering.
Big Spatial Data Processing using Spark
Big Data Toolkit for the JVM
Scala library for scraping metadata from specified URLs (e.g. OpenGraph)
Use Scala API to read/write data from different databases,HBase,MySQL,etc.
dllib is a distributed deep learning library running on Apache Spark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
General Vectorization Lib for Machine Learning Tools
Google BigQuery support for Spark, SQL, and DataFrames
Spark MLlib wrapper for the Snowball framework
A general Inference API based on two of the most popular Big Data processing engines: Apache Spark and Apache Flink
Spark package to "plug" holes in data using SQL based rules ⚡️ 🔌
Fork of dmlc/xgboost for RAPIDS + XGBoost integration
Run spark calculations from Ammonite