-
databrickslabs/automl-toolkit 0.7.2
Toolkit for Apache Spark ML for Feature clean-up, feature Importance calculation suite, Information Gain selection, Distributed SMOTE, Model selection and training, Hyper parameter optimization and selection, Model interprability.
Scala versions: 2.11 -
potix2/spark-google-spreadsheets 0.6.3
Google Spreadsheets datasource for SparkSQL and DataFrames
Scala versions: 2.11 -
uosdmlab/spark-nkp 0.3.3
Natural Korean Processor for Apache Spark
Scala versions: 2.11 -
cerndb/sparkplugins 0.3
Code and examples of how to write and deploy Apache Spark Plugins. Spark plugins allow runnig custom code on the executors as they are initialized. This also allows extending the Spark metrics systems with user-provided monitoring probes.
Scala versions: 2.13 2.12 -
starlake-ai/starlake 1.3.0
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
Scala versions: 2.13 2.12 -
locationtech-labs/geopyspark 0.3.0
GeoTrellis for PySpark
Scala versions: 2.11 -
coxautomotivedatasolutions/spark-distcp 0.2.5
A re-implementation of Hadoop DistCP in Apache Spark
Scala versions: 2.13 -
absaoss/hyperdrive 4.7.0
Extensible streaming ingestion pipeline on top of Apache Spark
Scala versions: 2.12 2.11 -
tharwaninitin/etlflow 1.7.3
EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for running complex Auditable workflows which can interact with Google Cloud Platform, AWS, Kubernetes, Databases, SFTP servers, On-Prem Systems and more.
Scala versions: 3.x 2.13 2.12Scala.js versions: 1.x -
zuinnote/spark-hadoopoffice-ds 1.7.0
A Spark datasource for the HadoopOffice library
Scala versions: 2.13 2.12 2.11 -
heartsavior/spark-sql-kafka-offset-committer 0.2.0
Kafka offset committer for structured streaming query
Scala versions: 2.12 2.11