Introducing Gallia: a Scala library for data transformation

by Anthony Cros (2021)

Introduction

Gallia is a Scala library for generic data transformation whose main goals are:

Practicality
Readability
Scalability (optionally)

Execution happens in two phases, each traversing a dedicated execution DAG:

A initial meta phase which ignores the data entirely and ensures that transformation steps are consistent (schema-wise)
A subsequent data phase where the data is actually transformed

See introductory articles in Towards Data Science: Introduction and Follow-up. The rest of this README serves as temporary documentation. More thorough discussions of design choices/limitations/direction will come as subsequent article(s).

Preliminary notes:

Some links lead to documentation that is still to be written.
The examples use JSON because of its ubiquity as a notation, and despite its flaws

Dependencies

The library is available for Scala 2.12, 2.13, and 3.3.1

Include the following in your build.sbt file:

libraryDependencies += "io.github.galliaproject" %% "gallia-core" % "0.6.1"

The client code then requires the following import:

import gallia._

One can also optionally add the following import for general utilities:

// our open-source utilities library,
//   see https://github.com/aptusproject/aptus-core
import aptus._

Preliminary examples

While Gallia shines with (and makes most sense for) complex data processing such as this one (dbNSFP), it can also cater to the more trivial cases such as the ones presented below as an introduction. The same paradigm can therefore handle all (most) of your data manipulation needs.

Process individual entity

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
  .read() // will infer schema if none is provided

    // uppercase string value for field "foo" ("hello" -> "HELLO")
    .toUpperCase('foo)

    // increment integer value for field "bar" (1 -> 2)
    .increment('bar)

    // remove field "qux" (irrespective of field type)
    .remove('qux)

    // nest (boolean) field "baz" under (new) field "parent"
    .nest('baz).under('parent)

    // flip boolean value of field "baz" (now nested under "parent")
    .flip('parent |> 'baz)

  .printJson()
  // prints: {"foo": "HELLO", "bar": 2, "parent": { "baz": false }}

It is very important to note that the schema is maintained throughout operations, so you will get an error if you try for example to square a boolean:

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
  .read()
      .toUpperCase('foo)
      .increment  ('bar)
      .remove     ('qux)
      .nest       ('baz).under('parent)
      .square     ('parent |> 'baz ~> 'BAZ) // instead of "flip" earlier
    .printJson()
    // ERROR: TypeMismatch (Boolean, expected Number): 'parent |> 'baz

Notes:

This error occurs prior to the actual data run, and no data is therefore processed (potential schema inferrence aside)
The error mechanisms works at any level of nesting/multiplicity
Of course, some errors cannot be caught until the data is actually seen (e.g. IndexOutOfBounds types of checks)

Process collection of entities

// INPUT:
//    {"first": "John", "last": "Johnson", "DOB": "1986-02-04", ...}\n
//    {"first": "Kate", ...
"/data/protopeople.jsonl.gz"
  .stream() // vs .read() for single entity

    .generate('username).from(_.string('first), _.string('last))
      .using { (f, l) => s"${f.head}${l}".toLowerCase } // -> "jjohnson"
    .toUpperCase('last)
    .fuse('first, 'last).as('name).using(_ + " " + _)
    .transformString('DOB ~> 'age).using(
        _.toLocalDateFromIso.getYear.pipe(2021 - _))

  .write(
    "/tmp/people.jsonl.gz")
    // OUTPUT:
    //  {"username": "jjohnson", "name": "John JOHNSON", "age": 32, ...}\n
    //  {"username": ...

Notes:

JSONL = one JSON document per line
This example makes use of:
- .pipe() from scala.util.chaining
- .toLocalDateFromIso() from our import aptus._ above (see docs)

Process CSV/TSV files

"/data/some.tsv.gz"
  .stream()
    .retain('_id, 'age, 'gender)
    .groupBy('age)
  // ...

See more in inputs below.

Basics

Key referencing

Keys can be referenced as scala's String, Enumeration, and enumeratum.Enum

"""{"foo": 1}"""
  .read().rename("foo" ~> 'FOO)
  // OUTPUT: {"FOO":1}

"""{"Very Poor Key Choice  ":
    "please_stop_using_spaces_and_unnecessary_uppercasing_in_keys"}"""
  .read()
    .rename("Very Poor Key Choice  " ~> 'much_better)
    .transformString('much_better).using(_ => "isn't it?")
  // OUTPUT: {"much_better": "isn't it?"}

Target selection (keys/paths)

Applicable for both .read() and .stream() (one vs multiple entities)

// INPUT: {"foo": "hello", "bar": 1, "baz": true, "qux": "world"}
data.retain(_.firstKey) // {"foo": "hello"}

data.retain(_.allBut('qux))      //{"foo": "hello", "bar": 1, "baz": true}
data.retain(_.customKeys(_.tail))//{"bar": 1, "baz": true, "qux": "world"}

Generalization of target selection

Likewise applicable for both .read() and .stream()

val obj = """{"foo": "hi", "bar": 1, "baz": true, "qux": "you"}""".read()

// can't use "then" (reserved in scala)
obj.forKey    ('foo)      .thn(_ toUpperCase _) // { "foo": "HI", ...
obj.forEachKey('foo)      .thn(_ toUpperCase _)
obj.forEachKey('foo, 'bar).thn(_ toUpperCase _)

obj.forAllKeys((o, k) => o.rename(k).using(_.toUpperCase)) //{"FOO":"hi",..
// ... likewise with forPath, forEachPath, forAllPaths, forLeafPaths, ...

Nested data selection

Paths can be referenced conveniently via the "pipe+greater-than" (|>) notation:

"""{"parent": {"foo": "bar"}}""".read()
  .toUpperCase('parent |> 'foo)
  // OUTPUT: {"parent":{"foo":"BAR"}}
``

Notes:
- A _key_ is just a trivial _path_.
- _Gallia_ can generally apply transformations irrespective of multiplicity, as long as they still make sense:

```scala
"""{"parent": {"foo": ["bar", "baz"]}}""".read()
  .toUpperCase('parent |> 'foo)
  // OUTPUT: {"parent":{"foo":["BAR", "BAZ"]}}

Renaming keys

Renaming can be expressed conveniently via the "tilde+greater-than" (~>) notation :

           """{"foo": "bar"}""" .read().rename           ('foo ~> 'FOO)
"""{"parent": {"foo": "bar"}}""".read().rename('parent |> 'foo ~> 'FOO)
// OUTPUT: (respectively)
//             {"FOO":"bar"}
//   {"parent":{"FOO":"bar"}}

A case could be made that rekey would be more appropriate than rename, but it feels rather unnatural.

Renaming keys "while-at-it"

"""{"foo": 1}""".read()
  .increment('foo ~> 'FOO)
  // OUTPUT: {"FOO":2} - value is incremented and key is uppercased

Note that this is functionally equivalent too:

"""{"foo": 1}""".read()
  .increment('foo)
  .rename   ('foo ~> 'FOO)

Single vs Multiple entities

Gallia does not necessarily expect its elements ("entities") to come in multiples, it is capable of processing them as individuals.

Example of going from one to the other, then back:

"""{"foo": "bar"}""".read()
    .convertToMultiple // now     [{"foo": "bar"}]
    .head              // back to  {"foo": "bar"}

In a nested context:

"""[{"foo": "bar1"}, {"foo": "bar2"}]""".stream()
  .asArray1        //  {"foo":["bar1","bar2"]}
  .flattenBy('foo) // [{"foo": "bar1"}, {"foo": "bar2"}] (original array)

There are other ways to go back and forth between the two (e.g. reducing as shown below)

Internally, all entity-wise operations on "streams" are actually just implicit MAP-pings, so that the following two expressions are equivalent

"""[{"foo": "bar1"}, {"foo": "bar2"}]""".stream()      .toUpperCase('foo)
"""[{"foo": "bar1"}, {"foo": "bar2"}]""".stream().map(_.toUpperCase('foo))

DAG Heads

The Head type models a leaf in the DAG(s) that underlies the execution plan.

Internally, heads comes in as three flavors, each offering a different and relevant subset of operations:

HeadO: For single O-bject manipulation
HeadS: For multiple object-S manipulation
HeadV[T]: For "naked" V-alues manipulation (HeadV is rarely encountered explicitly in client code)

Notes:

"Naked" values are more conceptually relevant to nested subgraphs, not commonly manipulated by client code. It represents values that are not part of a structured entity, e.g the string "foo" alone as opposed to the same string "foo" within an entity {"key1": 1, "key2": "foo", ...}.
The DAGs/heads concepts will be discussed in more details in a future article dedicated to design.

SQL-like querying

people
  // INPUT: [{"name": "John", "age": 20, "city": "Toronto"}, {...

    /* 1. WHERE            */ .filterBy('age).matches(_ < 25)
    /* 2. SELECT           */ .retain('name, 'age)
    /* 3. GROUP BY + COUNT */ .countBy('age)

  // OUTPUT: [{"age": 21, "_count": 10}, {"age": 22, ...

WHERE clause: Alternatively as filterBy(_.int('age)).matches(_ < 25) if need more than the basic =, <, >, +, ... (see types)
SELECT clause: this would actually be redundant since the subsequent GROUP BY step also retains those fields implicitly
GROUP BY + COUNT: if unspecified, uses default _count output field

Reduction

people.reduceWithMean('age)      // {"age":21.5}
people.reduce('age).wit(_.stdev) // {"age":1.118[...]}

More powerfully:

people
  .reduce(
      'age .aggregates(_.mean, _.stdev),
      'city.count_distinct)
  // OUTPUT: {"age":{"_mean":21.5,"_stdev":1.118[...]},"city":3}

Aggregations

people.group('name).by('city)

// "GROUP all keys but the last key BY that last key"
people
  .group(_.initKeys)
    .by(_.lastKey)
      .as('grouped) // would use '_group if unspecified
  //OUTPUT: [
  // [{"gender":"male","grouped":[{"name":"John","age":21,"city":"Toronto"},
  //     ... ]

// other count types available:
//   distinct, present, missing and distinct+present
people.count('name).by('city)

people.sum  ('age).by('city) // also sum, mean, stdev, ...
people.stats('age).by('city) // descriptive statistics (minimal for now)
  // OUTPUT: [ {"city":"Toronto","_stats":{"mean":21.0, ...

A more "custom" aggregation (nonsensical):

people
  .groupBy('city)
  .transformGroupEntitiesUsing {
    _.squash(_.string('name), _.int('age))
      // random nonsensical aggregation for demonstration purpose only
      .using(_.map { case (n, a) => n.size + a }.sum) }
  .rename(_group ~> 'awesomeness)
  // OUTPUT:
  //  [{"city":"Toronto"     , "awesomeness":25},
  //   {"city":"Philadelphia", "awesomeness":24},
  //   {"city":"Lyon"        , "awesomeness":53}, ... ]

Pivoting

people
  .pivot(_.int('age)).usingMean
    .rows   ('city)
    .column ('gender)
      // having to provide those is an unfortunate consequence of
      // maintaining a schema (these values are only known at runtime)
      .asNewKeys('male, 'female)
  // OUTPUT:
  //  [ {"city":"Toronto","male":21},
  //    {"city":"Toronto","female":20},
  //    {"city":"Lyon","male":22.5},     ...]

Note that unpivoting isn't available, but scheduled

Renesting Tables

Common prefixes can be leveraged for re-nesting, e.g. "contact_" below:

// INPUT: "name<TAB>contact_phone<TAB>contact_address<TAB>..."
//                  ^^^^^^^           ^^^^^^^
table
  .renest(_.allKeys)
    .usingSeparator("_")
    // OUTPUT: "{"name":"John", "contact":{"phone": 1234567, "address":..
    //                           ^^^^^^^

This mechanism is not limited to a single level, it can transform keys:

foo_bar_baz1<TAB>foo_bar_baz2<TAB>...

into

{"foo": {"bar": {"baz1": ..., "baz2": ...}}, ...}

In practice the renesting operation typically involves a lot more work, e.g. if a value is like "foo1,foo2,foo3", it may also need to be split and denormalized on a one-per-row basis. It is also common to encounter values such as "John:32|Kate:33|Jean:34" or combinations of values such as "John|Kate|Jean" + "32|33|34" (the latter two actually sharing the same cardinality of elements pipe-wise). This alone would deserve its own article, but in the meantime the DbNsfp example highlights a number of interesting such cases.

The opposite operation (flattening to table) is scheduled .

IO

Input

.read() (single entity) and .stream() (multiple entities) guess as much about the input format as they can from the input String provided:

JSON markers, e.g. {, [, ...
extensions, e.g. .json, .tsv, .gz, ...
URI schemes, e.g. file://, http://, jdbc://, ..
...

We will see later an example of how to override the default behavior for reading and writing.

Here are some examples of input consumption:

// will infer schema (costly timewise)
"/some/local/file.json" .read  ()
"/some/local/file.jsonl".stream()

// providing schema
"/some/local/file.json" .read  [MyCaseClass]
"/some/local/file.jsonl".stream[MyCaseClass]

// equivalently
"/some/local/file.json" .read  ('foo.string, 'baz.int)
"/some/local/file.jsonl".stream('foo.string, 'baz.int)

       "/some/local/file.jsonl".stream()
"file:///some/local/file.jsonl".stream()

 "http://someserver/test.jsonl".stream()
"https://someserver/test.jsonl".stream()

"ftp://someserver/pub/foo/bar.tsv".stream()

// must make corresponding JDBC driver jar available
"jdbc:myfavdb://localhost:1234/test?user=root&password=root"
  .stream(_.allFrom("TABLE1"))

"jdbc:myfavdb://localhost:1234/test?user=root&password=root"
  .stream(_.query("SELECT * from TABLE1"))

(conn: java.sql.Connection)       .stream(_.sql("SELECT * from TABLE1"))
(ps:   java.sql.PreparedStatement).stream()

// requires gallia-mongodb module and import gallia.mongodb._
//   (see https://github.com/galliaproject/gallia-mongodb)
"mongodb://localhost:27017/test.coll1".stream()
"mongodb://localhost:27017/test"      .stream(_.query("""{"find":"coll1"}"""))

Tables

Considering the following TSV file:

$ cat /data/some.tsv | column -nt
f1  f2  f3   f4     f5     f6  f7     f8
z   1   1.1  true   9,8,7  k   d,e,f  T
y   2   2.2  false  6,5,4

And the following call:

"/data/some.tsv".stream()

// or its explicit equivalent
"/data/some.tsv".stream(_.tsv.inferSchema)

The following schema and data will be inferred and ingested:

val schema =
  cls(
      'f1.string,  'f2.int     , 'f3.double, 'f4.boolean, 'f5.ints,
      'f6.string_, 'f7.strings_, 'f8.boolean_)

val data =
 Seq(
  obj('f1 -> "z", 'f2 -> 1, 'f3 -> 1.1, 'f4 -> true , 'f5 -> Seq(9, 8, 7),
        'f6 -> "k", 'f7 -> Seq("d", "e", "f"), 'f8 -> true),
  obj('f1 -> "y", 'f2 -> 2, 'f3 -> 2.2, 'f4 -> false, 'f5 -> Seq(6, 5, 4)))

Note that _ here stands for ?, meaning optional. For instance 'f7.strings_ would be represented as Option[Seq[String]] in Scala.

Apache Avro

Avro read/write support was added in 0.4.0, see CHANGELOG.md#avro

Apache Parquet

Likewise, Parquet read/write support was added in 0.4.0, see CHANGELOG.md#parquet

Additional sources/destinations

Additional modules using a similar paradigm will be added in the future, e.g.:

// NEO4J
"neo4j+s://demo.neo4jlabs.com".stream(
    _.query("""(:Person {name: string})
        -[:ACTED_IN {roles: [string]}]
          ->(:Movie {title: string, released: number})"""))

// Sparql
"http://www.disease-ontology.org?query=".stream(
    _.query("""
      SELECT DISTINCT *
      WHERE {?s <http://www.w3.org/2000/01/rdf-schema#label> "common cold"}
      LIMIT 3"""))

// GraphQL
"https://swapi.com/graphql".stream(
    _.query(
        """{user (id: 1) { firstname } }"""))

// Excel (if sheet contains a single table)
"/data/doc.xlsx".stream(_.allFrom("Some Sheet Name"))

// XML
"/data/doc.xml".stream() // Requires costly schema inferring first

Note: There are proof of concepts for the last two (XML and Excel).

Output

Output works in a similar fashion, relying on extensions/URI schemes as much as possible

modifiedPeople.write("/tmp/output/result.tsv")
modifiedPeople.write("/tmp/output/result.jsonl.bz2")

// these are not actually implemented for mongo yet (only reading is):
modifiedPeople.write("mongodb://localhost:27017/test.coll1")
modifiedPeople.write(
    uri       = "mongodb://localhost:27017/test",
    container = "coll1")

modifiedPeople.write(
  uri       = "jdbc:myfavdb://localhost:1234/test?user=foo&password=bar",
  container = "SOME_RESULT_TABLE")

Scaling

Spark RDDs

See Apache Spark's RDD documentation.

This module requires

libraryDependencies += "org.gallia" %% "gallia-spark" % "0.6.1"

And the following import:

import gallia.spark._

Abstraction:

The main abstraction in Gallia for top-level multiplicity is data.multiple.streamer.Streamer[T], which is then wrapped by the data.single.Obj-aware counterpart data.multiple.Objs (wraps a Streamer[Obj]). It currently comes in three flavors, all also under data.multiple.streamer:

ViewStreamer: default
IteratorStreamer: enabled via .stream(_.iteratorMode)
RddStreamer: enabled via usage of a SparkContext if gallia.spark._ has been imported

Example:

See Spark used in action in this repo

Bypassing abstraction:

You can modify the underlying RDD (think Law of Leaky Abstractions) via .rdd(), eg:

data
  // ...
  // can by-pass abstraction when needed,
  //   though schema is not allowed to change
  //   (which cannot be enforced)
  .rdd { _.coalesce(1).cache }
  // ...

Poor man's scaling ("spilling")

May be useful to your average scientist who may have access to powerful machines (think qsub) but not to conveniently provisioned clusters. Sadly this is a very common occurrence in research settings and the author cares deeply about this problem.

"/data/huge.tsv.bz2"
  // uses an GNU sort-based approach to sorting/grouping/joining
  .stream(_.iteratorMode)
    .rename('gene).to('hugo_symbol)
    .groupBy('mutation_id).as('genes)
    // ...

Notes:

All wide transformations can be written in terms of an external sort such as GNU sort
We can combine such operations and leverage pipes to ensure the execution tree is executed lazily (forking however would benefit from a form of checkpointing)
GNU sort is favored for now because replacing it would constitute an significant endeavour, and even then it would be extremely hard to beat performance-wise
Ideally this would be an alternative run mode for Spark itself
The current implementation can be seen in action in the GeneMania processing sub-project
This feature is only partially implemented. It's basically enabled via the _.stream(_.iteratorMode.[...]) call, and follows this type of invocation paths: Streamer.groupByKey -> Iterator's -> utility -> GNU sort wrapper

Explicit types

Let's revisit the SQL-like example. Note that the Whatever type placeholder is being used (basically an Any wrapper that accepts very basic operations such as +, <, etc.)

// the following two expressions are equivalent:
//
//          omitting type implies the use of Whatever here  and here
//         v                 v                            v          v
z.fuse(         'first ,          'last ).as('name).using(_  + " " + _)
z.fuse(_.string('first), _.string('last)).as('name).using(_  + " " + _)
//                                                        ^          ^
//                                                         vs strings

A more disciplined and powerful approach than relying on Whatever is to be explicit, which gives access to all the corresponding type's operations

z.fuse(_.string('first), _.string('last)).as('name)
   // .head and .toUpperCase require knowledge of the exact type (String here)
   .using { (f, n) => s"${f.head}${n.toUpperCase}" }

More types than the currently supported BasicTypes will be added in the future

Schema (metadata)

Gallia is "schema-aware", meaning it keeps track of schema changes for every step. This allows the library to detect many errors prior to seeing the actual data.

As we've seen before, there are multiple ways to explicitly provide the data's underlying schema. This saves the library the task of looping over the data first to "infer" said schema.

By using a case class

case class Foo(foo: String, bar: Int, baz: Boolean, qux: String)

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}""".read[Foo]

2. By providing it "manually"

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
  // underscore means optional (since can't conveniently use '?' in Scala)
  .read('foo.string, 'bar.int, 'baz.boolean, 'qux.string, 'corge.string_)

3. By providing an external resource that contains a JSON-serialized version of the schema

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
    .read("/meta/myschema.json")

Where "/meta/myschema.json" contains: {"fields":[{"key":"foo","info":...

More interactions with case classes are available (e.g. in transformations); they will be detailed in a future article.

Note: Gallia schemas are mostly meant to be descriptive, but they can be prescriptive in the case of looser formats such as JSON or {T,C}SV files. For instance a field defined as an _Int in a schema describing a numerical JSON entry will be interpreted as an _Int instead of a _Double (as would be expected from the JSON specification).

Macros

See dedicated repo, which contains examples

Full blown example

I am providing a link to one of the full blow examples I've written using Gallia: turning the big dbNSFP tables into a corresponding nested structure more conducive to querying (mongodb, elasticsearch, ...). See the example input row and example output entity.

It is in no way complete or 100% correct in its current form, as it is primarily designed to showcase Gallia. I only tested it on a small subset of the data, and I expect unfortunate surprises would arise from processing the entire dataset.

It showcases among other things how to turn a long String full of extractable information, e.g:

"Loss of ubiquitination at K551 (P = 0.0092); Loss of methylation [...]"

Into a more parseable object:

[
  {"type":"loss", "change_type":"ubiquitination",
     "location":"K551", "p_value":0.0092 },
  {"type":"loss", "change_type":"methylation",
      ... },
  ...
]

Via an intermediate Scala case class (which contains most of transformation logic):

// ...
.transformString(top_5_features).using(MutPred.apply)
// ...

Processing this kind of data is exactly why I designed the library in the first place. I believe a lot of useful knowledge can be unlocked by making this kind of resource more parseable (DbNsfp itself is an incredibly useful resource in terms of content). The field of bioinformatics in particular is laden with archaic technologies and practices, which in turns results in tons of lost opportunities for impactful medical discoveries. I have never dealt with it personally but I imagine the likes of computational physics and other "computational-driven" disciplines probably suffer from similar problems.

List of concrete examples

Trivial examples:
- Word Count example, the "hello world" of big data
- Count by word length example
SQL-like:
- Northwind queries: coming soon
Web application server logic:
- cbioportal's "studies summary" API call: reproducing response to obtain a summary of all studies for cbioportal, arguably the most commonly used web portal for cancer data. This is the first API call made upon loading the portal's main page, and it is specified on their swagger page. See dedicated page for the code.
Reproducing random examples encountered in articles on data manipulation:
- Notebooks for Databricks articles (Spark) Datasets tutorial and Complex nested structures
- TPC-DS Sales summary example query as discussed in Andrew Ray's Databricks post: "Reshaping Data with Pivot in Apache Spark" (February 2016)
- data manipulation task for the Cars93 dataset (R MASS package), as discussed in Darren Wilkinson's blog post: "Data frames and tables in Scala" (August 2015)
- Eurostat census data example queries as discussed in Mathijs Vogelzang's Medium article: "Doing cool data science in Java: how 3 DataFrame libraries stack up" (September 2018)
- Football premier league data manipulations as discussed in Chloe Connor's Towards Data Science article: "Stop using Pandas and start using Spark with Scala" (June 2020)
Bioinformatics examples
- re-processing clinvar VCF file
- re-processing SnpEff output
- re-processing dbNSFP table example from section just above
- re-processing GeneMania TSV files; uses the poor man's scaling approach (spilling)
- re-processing rare disease LOVD data (from EDS Variant Database)
Physics examples
- ENSDF data (WIP)
- WIP (see forum question)
Spark-powered:
- GeneMania TSV files via Spark RDDs
(more coming soon)

Strengths

Gallia's main strengths can be summed up like so:

Offers a one-stop shop paradigm for most or all data transformations needs within one's application.
The most common/useful data operations are provided, or at least scheduled.
Readable DSL that domain experts should be able to at least partially comprehend.
Scaling is not an afterthought and Spark RDDs can be leveraged when required.
Meta-awareness, meaning inconsistent transformations are rejected whenever possible (for instance, cannot use a field that's been removed already).
Can process individual entities, not just collections thereof; that is, there's no need to create "dummy" collections of one entity in order to operate on that entity.
Can process nested entities of any multiplicity in a natural way.
Macros are available for a smooth integration with case class hierarchies.
Provides flexible target selection - i.e. which field(s) to act on - which ranges from explicit reference to actual queries, including when nesting is involved.
The execution DAG is sufficiently abstracted that its optimization is a well-separated concern (e.g. predicate pushdowns, pruning, ...); note however, that few such optimizations are in place at the moment.

FAQ

Is this ready for production?

Not even remotely. There are known bugs, blatantly missing features, a lot of missing validation, and most importantly it performs rather slowly at the moment. There is a lot planned in the way of addressing these issues, but it will require more resources than the author working alone. In particular, performance has a prominent place in the task list.

How can I help?

I'm already aware of many issues and have a long list of tasks meant to address them, as well as add the features that are critically missing. As a result the most useful thing one can do to help at the moment is simply letting me know if this is an effort worth pursuing. Once a definitive license is chosen, code contributions will be more than welcome.

What are the biggest limitations by design?

~~At this point, a given field can only be of a given type. Ironically this prevents Gallia from having its own metaschema specified in Gallia terms.~~ (see metaschema, made possible by (partial) union types). ~~See problem in action in the code~~ A more thorough discussion of design choices and trade-offs/limitations will come in a future article.

Another potential trick is that there can be only one meaning to a missing value. For instance [{"foo": null}, {"foo": []}, {}] would all collapse to the same absence of a value: {}. Note that overloading the various null/Nil mechanisms with alternative meanings is probably not great data modeling practise in the first place.

In what way is readability prioritized?

We aim to make the code as readable as possible (goal #2) whenever it doesn't affect practicality (goal #1). In particular we want to make it possible for domain experts - who may not be programmers - to understand at least superficially what is happening in each step. It is obviously not always feasible for the task at hand, but this is otherwise a major goal for the library.

What are good use cases for the library?

The main use cases that come to mind at this point are batch ETL, querying, feature engineering, internal application logic, and data validation and evolution. On the batch ETL front, it would be interesting to see how alternative libraries/languages take examples such as the dbNSFP one above. In particular, how would the various thresholds (readability/practicality/scalability) be shifted by a different choice.

What about features like streaming? EDA? visualization? linear algebra? graph queries? notebooks? metadata semantics ? squaring the circle?

There are lots of features that could be added in the future, but they all require a pretty sturdy base first.

Note that the most important part of the library at this point is its client code interface. The internals could be entirely scrapped in the future, though it's more likely it would be replaced in phases short of a major design flaw.

Why not more macros-based features?

I prototyped a lot with macros and they will play an important role in the future of Gallia.

They can also be tricky to deal with, and since they are scheduled for a major overhaul, I am reluctant to invest a lot of time on that front at the moment. I see them helping a lot in particular with boilerplate and some compile-time validation (e.g. key validation). The very initial plan was to leverage whitebox macros for every step, but I gave up on the idea pretty early on. I'd like to re-investigate it for a subset of features/use cases at some point however, especially since there seems to be some interesting projects (e.g. quill) that already make interesting use of them.

Where is the category theory?

I'm quite impressed with the likes of cats (-> great book) or shapeless but while I find them intellectually fascinating, I do side with the "blue sky" perspective when it comes to prioritizing practicality.

What about other programming languages?

Initially the idea was for this to be a language agnostic DSL for data manipulation, with a reference implementation in Scala basically acting as specification. It may still become a reality but I'd rather focus on maturing a Scala version first.

What is aptus?

"Aptus" is latin for suitable, appropriate, fitting. It is our utility library to help smooth certain pain points of the Java/Scala ecosystem. It was originally included in Gallia for convenience, but is now externalized in its own repo (Apache 2 licensed)

Where are the tests?

They live in a different repo and are being introduced incrementally (unpublished ones need a lot of cleaning up). They basically take the following form:

aobj( // the "a" in aobj stands for "Annotated"
    cls('p   .cls_('f.string  , 'g.int ), 'z.boolean))(
    obj('p -> obj ('f -> "foo", 'g -> 1), 'z -> true) )
  .generate('h)
    .from(_.entity('p))
    .using {
        _ .translate('f ~> 'F).using("foo" -> "oof")
          .remove('g) }
  .check {
    aobj(
      cls('p   .cls_('f.string, 'g.int   ), 'z.boolean, 'h .cls_ ('F.string)))(
      obj('p -> obj ('f -> "foo", 'g -> 1), 'z -> true, 'h -> obj('F -> "oof")) ) }

Where check wraps an equality assertion. I have not settled on a definitive testing library yet, though considering utest at this point.

Why so few comments, especially scaladoc?

I try to leverage the language constructs as much as possible, e.g. by naming variables and methods so they convey semantics as much as possible. I then add the occasional comment when I deem it necessary, but overall expect any contributor to be sufficiently familiar with Scala to understand what's going on. As the project matures, proper scaladoc-friendly comments can hopefully be added as well.

Why does the terminology sometimes sound funny or full-on neological?

Naming things is hard. Sometimes I give up and favor an alternative until a better idea comes along. Sometimes a temporary name just sticks around, by way of organic growth. More generally I'd like to create an OWL ontology to more formally define terms that may deserve it.

What's with the IDs that look like timestamps and pop up everywhere (e.g. `210121162536`)?

They're my quick-and-dirty mechanism for ID-ing elements, and are generated by combining the date command along with xautomation, called via xbindkeys keyboard shortcuts. When they represent a task, it allows me to ID the task temporarily. Many small tasks will never see an actual issue tracking system ID assigned to them. Note that the timestamp itself is never guaranteed to be meaningful, as I occasionally hack them around (for consolidation purposes for instance).

Where does the name "Gallia" come from?

Gallia is the name of a Romano-Gallic goddess. It is also the latin name for Gaul, the area the author is originally from.

Rumor has it that the goddess Gallia appeared in 16 BCE to a group of data engineers gathered at a local tavern in Lugdunum (now Lyon), and that she told them to keepeth their code (1) practical, (2) readable, and (3) scalable (if needed), in that exact order.

Contact & Announcements

Contact: contact.galliaproject at gmail.com
Blog
Linked In
Twitter (@AnthonyCros) - for further announcements
Original announcement on the Scala Users list

galliaproject / gallia-core 0.6.1

Introducing Gallia: a Scala library for data transformation

Introduction

Dependencies

Preliminary examples

Process individual entity

Process collection of entities

Process CSV/TSV files

Basics

Key referencing

Target selection (keys/paths)

Generalization of target selection

Nested data selection

Renaming keys

Renaming keys "while-at-it"

Single vs Multiple entities

DAG Heads

SQL-like querying

Reduction

Aggregations

Pivoting

Renesting Tables

IO

Input

Tables

Apache Avro

Apache Parquet

Additional sources/destinations

Output

Scaling

Spark RDDs

Poor man's scaling ("spilling")

Explicit types

Schema (metadata)

Macros

Full blown example

List of concrete examples

Strengths

FAQ

Is this ready for production?

How can I help?

What are the biggest limitations by design?

In what way is readability prioritized?

What are good use cases for the library?

What about features like streaming? EDA? visualization? linear algebra? graph queries? notebooks? metadata semantics ? squaring the circle?

Why not more macros-based features?

Where is the category theory?

What about other programming languages?

What is aptus?

Where are the tests?

Why so few comments, especially scaladoc?

Why does the terminology sometimes sound funny or full-on neological?

What's with the IDs that look like timestamps and pop up everywhere (e.g. 210121162536)?

Where does the name "Gallia" come from?

Contact & Announcements

Statistics

5 Dependencies

9 Dependents

What's with the IDs that look like timestamps and pop up everywhere (e.g. `210121162536`)?