Spark DateTime Library
A library for exposing dateTime functions from the joda time library as SQL functions. Also provide a dsl for dateTime catylst expressions; this utilizes the [scala wrapper library](https://github.com/nscala-time/nscala-time nscala).
This library requires Spark 1.4+
You can link against this library in your program at the following coordiates:
groupId: org.sparklinedata artifactId: spark-datetime_2.10 version: 0.0.1
Using with Spark shell
This package can be added to Spark using the
--jars command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages org.sparklinedata:spark-datetime_2.10:0.0.1
- A set of functions from the joda library to operate on dates.
field access: all functions in the DateTime class are available as sql functions. The first argument is the DateTime object on which the function is to be applied.
construction: functions are available to convert a String or a epoch value to DateTime
comparisonfunctions available to compare dates (=, <, <=, >, >=), also compare against now.
arithmetic: functions available to add/subtract Period from dates.
intervals: functions available to construct Intervals and compare(contains, overlaps, abuts, gap) intervals and dateTimes.
- A dsl for dateTime catylst expressions.
- A StringContext to embed date expressions in SQL statements.
Function naming convention
- getter functions on the DateTime class are exposed with the same name, in camelCase. So getYear is exposed as year, getMonthOfYear is exposed as monthOfYear etc.
Assume you have a table input with a string column called dt
select dt, dateTime(dt), dayOfWeek(dateTime(dt)), dayOfWeekName(dateTime(dt)), dayOfWeekName(dateTimeWithTZ(dt)) from input
Date Expressions using the DSL
A basic example
import com.github.nscala_time.time.Imports._ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.sparklinedata.spark.dateTime.dsl.expressions._ import org.sparklinedata.spark.dateTime.Functions // register all functions Functions.register(sqlContext) val dT = dateTime('dt) val dOW = dateTime('dt) dayOfWeek val dOWNm = dateTime('dt) dayOfWeekName val dOWNm2 = dateTimeWithTZ('dt) dayOfWeekName val dTFixed = dateTime("2015-05-22T08:52:41.903-07:00") val t = sql(date"select dt, $dT, $dOW, $dOWNm, $dOWNm2, $dTFixed," + " dateTime(\"2015-05-22T08:52:41.903-07:00\") from input")
An example about periods
import com.github.nscala_time.time.Imports._ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.sparklinedata.spark.dateTime.dsl.expressions._ val dT = dateTime('dt) val dT1 = dateTime('dt) + 3.months val dT2 = dateTime('dt) - 3.months val dT3 = dateTime('dt) + 12.week val cE = dateTime('dt) + 3.months > (dateTime('dt) + 12.week) val t = sql(date"select dt, $dT, $dT1, $dT2, $dT3, $cE from input")
Weekend filter example
val filter : Expression = ((dateTime('dt) dayOfWeekName) === "Saturday") || ((dateTime('dt) dayOfWeekName) === "Sunday") val t = sql(date"select dt from input where $filter")
Group By example
val dayOfWeek: Expression = dateTime('dt) dayOfWeekName val t = sql(date"select $dayOfWeek, count(*) from input group by $dayOfWeek")
val i1 = END_DATE - 15.day to END_DATE - 10.day val isBefore = i1 isBeforeE dateTime('dt) val isAfter = i1 isAfterE dateTime('dt) val i2 = dateTime('dt) to (dateTime('dt) + 5.days) val overlapsE = i1 overlapsE i2 val abutsE = i1 abutsE i2 val t = sql(date"select dt, $isBefore, $isAfter, $overlapsE, $abutsE from input")
Use this feature to bucket dates into given Periods. For e.g. 8.hours, 30.mins, 2.days etc.
The following example buckets rows into 3 day periods. The bucket function on a DateExpression takes an origin date and a Period specification. The Period is an iso8061 specification for period.
val start = dateTime("2015-06-23T17:27:43.769-07:00") val dT = dateTime('dt) val timeBucket = dateTime('dt) bucket(start, 3.days) val t = sql(date"select dt, $dT, $timeBucket from input")
Or the direct sql for the above query is:
select dt, dateTime(`dt`), timeBucket(dateTime(`dt`),dateTime("2015-05-24T17:27:43.769-07:00"),period("P3D")) from input
Building From Source
This library is built with SBT, which is automatically downloaded by the included shell script. To build a JAR file simply run
build/sbt package from the project root. The build configuration includes support for both Scala 2.10 and 2.11.