yuhai1023 / spark-jq

A spark toolkit for half structured or structured format data's analytics

GitHub

SPARK-JQ

A Spark toolkit, not exactly like jq but inspired by it

Build Status

Supported RDD Format

  1. RDD[String] within json or jsonArray data type

Supported JSON field

  1. Number -> Scala Int, Long, Double
  2. String -> Scala String
  3. Object -> Scala Map
  4. Array -> Scala List
  5. Boolean -> Scala Boolean
  6. Compose Field, such as "map1.map2.intField" -> Type above

Usage

Sbt

libraryDependencies += "com.magicsoho" %% "spark-jq" % "0.1.0"

Maven

<dependency>
    <groupId>com.magicsoho</groupId>
    <artifactId>spark-jq_${your_scala_binary_version}</artifactId>
    <version>0.1.0</version>
</dependency>

RDDLike

  1. first of all

    import sjq.RDDLike._

  2. rdd.parseJson

    • parse json RDD into JSONObject RDD
  3. rddJson.fields("field1", "filed2")

    • return an RDD[List(field1Type, field2Type)]
  4. rddFields(n)

    • return RDD[element n in list]
  5. rddJson.key[T]("field1") or rdd.field(fieldFoo)

    • return RDD[T]
  6. rddJson.jsonObject("objKey")

    • return an JSONObject RDD
  7. rddField.[Int|Long|Double|Boolean|List[T]|Map[T1,T2]|JSONObject]

    • map RDD[Any] into RDD[T with specified type]

Lambda

  1. first of all

    import sjq.Lambda._

  2. addInt | addDouble

    • an RDD[(Key, (Int, Int))] use with reduceByKey(addInt)
  3. addTuple2

    • an RDD[(Key, ((AnyNumber, AnyNumber), (AnyNumber, AnyNumber))] use with reduceByKey(addTuple2)

Features in future

  1. support regex field
  2. support more format: csv, xml, ...
  3. support other input data, sql, kafka, flume, ...
  4. support other RDD reduce function utils

LICENSE

MIT