Bucket Metadata Search with Spark SQL (2.x)
The spark-ecs-connector project makes it possible to view an ECS bucket as a Spark dataframe. Each row in the dataframe corresponds to an object in the bucket, and each column coresponds to a piece of object metadata.
How it Works
Spark SQL supports querying external data sources and rendering the results as a dataframe. With the PrunedFilteredScan trait, the external data source handles column pruning and predicate pushdown. In other words, the WHERE clause is pushed to ECS by taking advantage of the bucket metadata search feature of ECS 2.2.
Linking to your Spark 2.x Application
The library is published to Maven Central. Link to the library using these dependency coordinates:
Using in Zeppelin
- Install Zeppelin 0.7+.
Create a notebook with the following commands. Replace
*** with your S3 credentials.
import java.net.URI import com.emc.ecs.spark.sql.sources.s3._ val endpointUri = new URI("http://10.1.83.51:9020/") val credential = ("***ACCESS KEY ID***", "***SECRET ACCESS KEY***") val df = sqlContext.read.bucket(endpointUri, credential, "ben_bucket", withSystemMetadata = false) df.createOrReplaceTempView("ben_bucket")
%sql SELECT * FROM ben_bucket WHERE `image-viewcount` >= 5000 AND `image-viewcount` <= 10000
The project use the Gradle build system and includes a script that automatically downloads Gradle.
Build and install the library to your local Maven repository as follows:
$ ./gradlew publishShadowPublicationToMavenLocal
- Implement 'OR' pushdown. ECS supports 'or', but not in combination with 'and'.
- Avoid sending a query containing a non-indexable key.