spotify / spark-bigquery

Google BigQuery support for Spark, SQL, and DataFrames

GitHub

spark-bigquery

Build Status GitHub license Maven Central

Google BigQuery support for Spark, SQL, and DataFrames.

spark-bigquery version Spark version Comment
0.2.x 2.x.y Active development
0.1.x 1.x.y Development halted

To use the package in a Google Cloud Dataproc cluster:

spark-shell --packages com.spotify:spark-bigquery_2.10:0.2.0

To use it in a local SBT console:

import com.spotify.spark.bigquery._

// Set up GCP credentials
sqlContext.setGcpJsonKeyFile("<JSON_KEY_FILE>")

// Set up BigQuery project and bucket
sqlContext.setBigQueryProjectId("<BILLING_PROJECT>")
sqlContext.setBigQueryGcsBucket("<GCS_BUCKET>")

// Set up BigQuery dataset location, default is US
sqlContext.setBigQueryDatasetLocation("<DATASET_LOCATION>")

Usage:

// Load everything from a table
val table = sqlContext.bigQueryTable("bigquery-public-data:samples.shakespeare")

// Load results from a SQL query
// Only legacy SQL dialect is supported for now
val df = sqlContext.bigQuerySelect(
  "SELECT word, word_count FROM [bigquery-public-data:samples.shakespeare]")

  // Save data to a table
df.saveAsBigQueryTable("my-project:my_dataset.my_table")

License

Copyright 2016 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0