massie / avrotuples   0.3

Apache License 2.0 GitHub

Avro Scala helper classes: AvroTuple[1-22]

Scala versions: 2.11 2.10

Avro Tuples

Build Status

The Scala library provides Tuple1 to Tuple22 that allow programmers to hold a fixed number of items together so they can be passed as a single object. While all the elements in an Array have the same type, a TupleN can have a mix of element types, e.g.

scala> val mytuple = ((2, "Be"), "Or", "Not", (2, "Be"))
mytuple: ((Int, String), String, String, (Int, String)) = ((2,Be),Or,Not,(2,Be))

scala> mytuple._1
res1: (Int, String) = (2,Be)

In this example, mytuple is a Tuple4 and has both Int and String elements.

The same code using Avro tuples, looks like...

scala> val mytuple = AvroTuple4(AvroTuple2(2, "Be"), "Or", "Not", AvroTuple2(2, "Be"))
mytuple: com.github.massie.avrotuples.AvroTuple4[com.github.massie.avrotuples.AvroTuple2[Int,String],String,String,com.github.massie.avrotuples.AvroTuple2[Int,String]] = ((2,Be),Or,Not,(2,Be))

scala> mytuple._1
res0: com.github.massie.avrotuples.AvroTuple2[Int,String] = (2,Be)

Using Avro Tuples with your project

Avro tuples is published to Maven Central.

In Maven, use

<dependency>
  <groupId>com.github.massie</groupId>
  <artifactId>avrotuples_**SCALA_VERSION**</artifactId>
  <version>**AVROTUPLES_VERSION**</version>
</dependency>

In sbt, add the line

libraryDependencies += "com.github.massie" %% "avrotuples" % "**AVROTUPLES_VERSION**"

Note, that for sbt you don't need to specify the Scala version since the line above uses %% which will automatically use the correct Scala version.

Avro Tuples are like Scala Tuples

  • Avro tuples can serve as a drop in replacement for Scala tuples
  • AvroTuple2 has a swap method just like Tuple2
  • All Avro tuples extend ProductN, e.g. AvroTuple1[T1] extends Product1[T1]
  • Avro tuples implement Externalizable making them Java serializable
  • Avro tuples can be nested

Avro Tuples have additional functionality over Scala tuples

Avro tuples implement SpecificRecord

This interface allows Avro to (de)serialize Avro tuples. An Avro serialize/deserialize round-trip looks like...

val tuple = AvroTuple2("This", AvroTuple4("That", "and", "the", "other"))
val outTuple = AvroTuple2.fromBytes(tuple.toBytes)
assert(tuple == outTuple)

Avro tuples implement KryoSerializable

If you pass Avro tuples to Kryo, the tuple will be (de)serialized in Avro format using the Avro tuple schema.

Avro tuples are mutable

You can update the values for an Avro tuple without needing to create a new tuple, e.g.

val tuple = AvroTuple2("One", 1L)
assert(tuple._1 == "One")
assert(tuple._2 == 1L)
tuple.update("Two", 2L)
assert(tuple._1 == "Two")
assert(tuple._2 == 2L)

Avro tuples have limitations (for now)

No syntactic sugar

Scala provides syntactic sugar that Avro tuples do not. In Scala, you don't need to write Tuple2("a", "b"), you can just use ("a", "b"). Avro tuple code is more verbose.

Limited number of types

For now, Avro tuples can be comprised of null values, strings, booleans, floats, doubles, ints, and longs. Support for more types is coming, e.g. Option.

Recursive schemas break Parquet

There is a known issue with Avro/Parquet and recursive schemas. AvroTuples use a recursive schema in order to support nesting. If you are using AvroTuples with Parquet, you will need to use the AvroFlatTupleX types, since they have flat schemas.

License

Avro tuples is released under an Apache 2.0 license.

Pull requests are welcomed.