Frameless is a Scala library for working with Spark using more expressive types. It consists of the following modules:
frameless-datasetfor a more strongly typed
frameless-mlfor a more strongly typed Spark ML API based on
frameless-catsfor using Spark's
RDDAPI with cats
Note that while Frameless is still getting off the ground, it is very possible that breaking changes will be made for at least the next few versions.
The Frameless project and contributors support the Typelevel Code of Conduct and want all its associated channels (e.g. GitHub, Gitter) to be a safe and friendly environment for contributing and learning.
Versions and dependencies
Versions 0.5.x and 0.6.x have identical features. The first is compatible with Spark 2.2.1 and the second with 2.3.0.
The only dependency of the
frameless-dataset module is on shapeless 2.3.2. Therefore, depending on
frameless-dataset, has a minimal overhead on your Spark's application jar. Only the
frameless-cats module depends on cats and cats-effect, so if you prefer to work just with
Datasets and not with
RDDs, you may choose not to depend on
Frameless intentionally does not have a compile dependency on Spark. This essentially allows you to use any version of Frameless with any version of Spark. The aforementioned table simply provides the versions of Spark we officially compile and test Frameless with, but other versions may probably work as well.
Frameless introduces a new Spark API, called
TypedDataset. The benefits of using
TypedDataset compared to the standard Spark
Dataset API are as follows:
- Typesafe columns referencing (e.g., no more runtime errors when accessing non-existing columns)
- Customizable, typesafe encoders (e.g., if a type does not have an encoder, it should not compile)
- Enhanced type signature for built-in functions (e.g., if you apply an arithmetic operation on a non-numeric column, you get a compilation error)
- Typesafe casting and projections
Click here for a detailed comparison of
TypedDataset with Spark's
- TypedDataset: Feature Overview
- Typed Spark ML
- Comparing TypedDatasets with Spark's Datasets
- Typed Encoders in Frameless
- Injection: Creating Custom Encoders
- Using Cats with RDDs
- Proof of Concept: TypedDataFrame
Frameless is compiled against Scala 2.11.x.
To use Frameless in your project add the following in your
build.sbt file as needed:
val framelessVersion = "0.6.1" // for Spark 2.3.0 or use 0.5.2 for Spark 2.2.1 libraryDependencies ++= List( "org.typelevel" %% "frameless-dataset" % framelessVersion, "org.typelevel" %% "frameless-ml" % framelessVersion, "org.typelevel" %% "frameless-cats" % framelessVersion )
An easy way to bootstrap a Frameless sbt project:
- if you have Giter8 installed then simply:
- with sbt >= 0.13.13:
sbt new imarios/frameless.g8
sbt console inside your project will bring up a shell with Frameless and all its dependencies loaded (including Spark).
Feel free to messages us on our gitter channel for any issues/questions.
We require at least one sign-off (thumbs-up, +1, or similar) to merge pull requests. The current maintainers (people who can merge pull requests) are:
Frameless contains several property tests. To avoid
OutOfMemoryErrors, we tune the default generator sizes. The following environment variables may be set to adjust the size of generated collections in the
Code is provided under the Apache 2.0 license available at http://opensource.org/licenses/Apache-2.0, as well as in the LICENSE file. This is the same license used as Spark.