This package contains AWS Glue extensions for use with Apache Iceberg.
Here are the instructions to start using the library on Apache Spark against Glue catalog 123456789012:rmscatalog/rmsdatabase
.
For more details and explanations about the configurations used, see later sections of the doc.
Amazon EMR Spark starting 7.5 automatically installs this extensions package. Enable Iceberg Spark cluster using instructions here, and launch Spark session with:
spark-sql \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.type=glue \
--conf spark.sql.catalog.my_catalog.glue.id=123456789012:rmscatalog/rmsdatabase \
--conf spark.sql.defaultCatalog=my_catalog \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,software.amazon.glue:glue-extensions-for-iceberg-spark-runtime-3.5_2.12:0.1.0 \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.my_catalog.glue.id=123456789012:rmscatalog/rmsdatabase \
--conf spark.sql.defaultCatalog=my_catalog \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,software.amazon.glue.GlueIcebergSparkExtensions \
--conf spark.hadoop.fs.s3a.impl=software.amazon.glue.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.credentials.resolver=software.amazon.glue.GlueTableCredentialsResolver \
--conf spark.hadoop.glue.id=123456789012:rmscatalog/rmsdatabase
Note: this command assumes that you have AWS SDK V1 and V2 are on your class path.
The glue-extensions-api.yaml
file contains the AWS Glue catalog extensions API OpenAPI specification.
The Glue extensions endpoints are located at <glue-api-endpoint>/extensions
, e.g. https://glue.us-east-1.amazonaws.com/extensions
.
See this page for all the AWS Glue API endpoints across AWS regions.
The catalog
folder contains code for MavenCentral library glue-catalog-extensions-for-iceberg
.
It provides a software.amazon.glue.GlueCatalogExtensions
class, which is an implementation of Iceberg's org.apache.iceberg.Catalog
Java interface
that leverages the AWS Glue Extensions APIs.
If you are working on library integration with it, use:
dependencies {
implementation "software.amazon.glue:glue-catalog-extensions-for-iceberg:0.1.0"
}
If you are working on engine integration, you might need to use the shaded runtime jar. It provides the same class path shading as the ones used in Iceberg engine runtime jars:
dependencies {
implementation "software.amazon.glue:glue-catalog-extensions-for-iceberg-runtime:0.1.0"
}
The spark
folder contains code for Maven library glue-extensions-for-iceberg-spark-3.5_2.12
.
It is recommended to use the runtime version of the library as a dependency, which provides the same class path shading as the ones used in Iceberg engine runtime jars:
// without shading
dependencies {
implementation "software.amazon.glue:glue-extensions-for-iceberg-spark-3.5_2.12:0.1.0"
}
// with shading
dependencies {
implementation "software.amazon.glue:glue-extensions-for-iceberg-spark-runtime-3.5_2.12:0.1.0"
}
You can also directly use it in your Spark application using:
spark-sql \
--packages software.amazon.glue:glue-extensions-for-iceberg-spark-runtime-3.5_2.12:0.1.0 \
--conf ...
See later sections of the doc for all required and optional configurations.
The package is bundled with a forked version of Hadoop S3AFileSystem
that contains an additional S3CredentialsResolver
pluggable interface.
software.amazon.glue.GlueTableCredentialsResolver
implements the S3CredentialsResolver
to provide access to data in RMS catalogs.
To use this resolver, set the following Spark configurations:
spark.hadoop.fs.s3a.impl = software.amazon.glue.s3a.S3AFileSystem
spark.hadoop.fs.s3a.credentials.resolver = software.amazon.glue.GlueTableCredentialsResolver
spark.hadoop.glue.id = <your Glue catalog ID>
See Hadoop S3A documentation for other S3A configurations you can set for other aspects of S3A.
We are working with the Apache Hadoop community to add this feature officially in the Hadoop AWS module.
The package is bundled with a forked version of Redshift Spark connector
that allows interaction with the AWS Glue RMS Catalogs and conversion of Iceberg data sources to Redshift data sources.
software.amazon.glue.GlueIcebergSparkExtensions
class implements the Spark extensions plugin interface and
allows automatic query acceleration fo RMS table queries using this forked Spark Iceberg connector.
To use the extension, set the following Spark configurations:
spark.sql.extensions = software.amazon.glue.GlueIcebergSparkExtensions
We are working with the Spark Redshift community to add this feature officially in the codebase.
GlueCatalogExtensions
is supposed to be used together with Apache Iceberg GlueCatalog
to offer richer capabilities to it.
In Amazon EMR 7.5 and later, this library is automatically bundled as a part of the EMR Spark Iceberg integration.
When you start an Iceberg GlueCatalog session in Spark, extensions will automatically be initialized behind the scene:
spark.sql.catalog.my_catalog = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.my_catalog.type = glue
spark.sql.catalog.my_catalog.glue.id = 123456789012:rmscatalog/rmsdatabase
We are actively working with the Apache Iceberg open source community to add this functionality to the open source version of Iceberg GlueCatalog library.
You can use the extensions just like any other Iceberg Java Catalogs. For example in Spark:
spark.sql.catalog.my_catalog = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.my_catalog.catalog-impl = software.amazon.glue.GlueCatalogExtensions
spark.sql.catalog.my_catalog.glue.id = 123456789012:rmscatalog/rmsdatabase
The following catalog properties used by Iceberg GlueCatalog
that are respected by GlueCatalogExtensions
:
glue.id
for Glue catalog ID, when not set, default to the root catalog of the calling accountclient.region
for overriding to use a specific Glue region. When not set, the default region provider chain is used to discover the region to use.glue.endpoint
for overriding to use a specific Glue endpoint. When not set, the official Glue regional endpoint will be use for GlueCatalog.
The extensions has its own set of configurations as well,
see software.amazon.glue.GlueExtensionsProperties
for a full list.
The extensions only supports a limited set of operations including:
CreateNamesapce
andDeleteNamespaces
asynchronouslyCreateTable
,DeleteTable
andUpdateTable
asynchronouslyLoadTable
with staging location and credentialsPreplanTable
andPlanTable
for server side table scan planning
If you execute any other operation (e.g. GetNamespace, ListTables), you will hit UnsupportedOperationsException
.
To have a full catalog experience, you could set a delegating Apache Iceberg catalog to fulfill the operations that extensions do not support. For examples, you can delegate operations to an AWS Glue Iceberg REST Catalog endpoint like the following:
spark.sql.catalog.my_catalog = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.my_catalog.catalog-impl = software.amazon.glue.GlueCatalogExtensions
spark.sql.catalog.my_catalog.glue.id = 123456789012:rmscatalog/rmsdatabase
spark.sql.catalog.my_catalog.glue.extensions.delegate-catalog-impl = org.apache.iceberg.rest.RESTCatalog
spark.sql.catalog.my_catalog.glue.extensions.delegate-catalog.uri = https://glue.us-east-1.amazonaws.com/iceberg
spark.sql.catalog.my_catalog.glue.extensions.delegate-catalog.rest.signing-name = glue
spark.sql.catalog.my_catalog.glue.extensions.delegate-catalog.rest.signing-region = us-east-1
By default, the Glue extensions uses a specific S3FileIO
that uses the extensions LoadTable API to
refresh API to refresh credentials. To override that behavior, you can provide a custom FileIO implementation through:
spark.sql.catalog.my_catalog = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.my_catalog.catalog-impl = software.amazon.glue.GlueCatalogExtensions
spark.sql.catalog.my_catalog.glue.id = 123456789012:rmscatalog/rmsdatabase
spark.sql.catalog.my_catalog.glue.extensions.file-io-impl = my.company.MyCustomFileIO
spark.sql.catalog.my_catalog.glue.extensions.file-io.property1 = value1
The GlueCatalogSessionExtensions
class can be used independently outside GlueCatalogExtensions
,
and can be used inject session-specific information like AWS credentials.
For example:
GlueCatalogSessionExtensions sessionExtensions = new GlueCatalogSessionExtensions();
sessionExtensions.initialize(...);
SessionCatalog.SessionContext sessionContext = new SessionCatalog.SessionContext(
"id",
"user",
ImmutableMap.of(
GlueExtensionsSessionProperties.CREDENTIALS_AWS_ACCESS_KEY_ID, "access-key-id",
GlueExtensionsSessionProperties.CREDENTIALS_AWS_SECRET_ACCESS_KEY,
"secret-access-key",
GlueExtensionsSessionProperties.CREDENTIALS_AWS_SESSION_TOKEN, "session-token"),
ImmutableMap.of());
Table table = sessionExtensions.loadTable(sessionContext, TableIdentifier.of("ns1", "table1"));
Supported JDK version: 11, 17
- To invoke a build and run tests:
./gradlew build
- To skip tests:
./gradlew build -x test
- To fix code style:
./gradlew spotlessApply
See CONTRIBUTING.md for more details about the AWS open source contribution guidelines.