In order to use the Spark extensions, you will need to load them and then add them to Spark.
From Maven Repository: _wait until we upload an initial version, until then see Development
Add:
--conf spark.sql.extensions=io.lakefs.iceberg.extension.LakeFSSparkSessionExtensions \
--packages io.lakefs:spark-extensions:<VERSION>
to your spark-*
command-line, or add this package to your
spark.jars.packages
configuration.
Run sbt package
, then add
--conf spark.sql.extensions=io.lakefs.iceberg.extension.LakeFSSparkSessionExtensions \
--jars ./target/scala-2.12/lakefs-spark-extensions_2.12-0.1.0-SNAPSHOT.jar`
refs_data_diff
is a Spark SQL table-valued function. The expression
refs_data_diff(PREFIX, FROM_SCHEMA, TO_SCHEMA, TABLE)
yields a relation that compares the "from" table PREFIX.FROM_SCHEMA.TABLE
with the "to" table PREFIX.TO_SCHEMA.TABLE
. Elements of "to" but not
"from" are added and appear with lakefs_change='+'
, elements of "from" but not
"to" are deleted and appear with lakefs_change='-'
.
For instance,
SELECT lakefs_change, Player, COUNT(*) FROM refs_data_diff('lakefs', 'main~', 'main', 'db.allstar_games')
GROUP BY lakefs_change, Player;
uses lakeFS Iceberg support to compute how many rows were changed for each player in the last commit.
Internally this relation is exactly a SELECT
expression. For instance,
you can set up a view with it:
CREATE TEMPORARY VIEW diff_allstar_games_main_last_commit AS
refs_data_diff('lakefs', 'main~', 'main', 'db.allstar_games');