Azure EventHubs + Apache Spark Connector
This is the source code for the Azure Event Hubs and Apache Spark Connector.
Azure Event Hubs is a highly scalable publish-subscribe service that can ingest millions of events per second and stream them into multiple applications. Spark Streaming and Structured Streaming are scalable and fault-tolerant stream processing engines that allow users to process huge amounts of data using complex algorithms expressed with high-level functions like
window. This data can then be pushed to filesystems, databases, or even back to Event Hubs.
By making Event Hubs and Spark easier to use together, we hope this connector makes building scalable, fault-tolerant applications easier for our users.
|Spark Version||Package Name||Package Version|
|Databricks Runtime Version||Package Name||Package Version|
|Databricks Runtime 3.4||azure-eventhubs-databricks_2.11|
|Databricks Runtime 3.3||azure-eventhubs-databricks_2.11|
Planned changes can be found on our wiki.
The best place to start when using this library is to make sure you're acquainted with Azure Event Hubs and Apache Spark. You can read Azure Event Hubs documentation here, documentation for Spark Streaming here, and, last but not least, Structured Streaming here.
Using the Connector
If you need additional assistance, please don't hesitate to ask! Just open an issue, and one of the repo owners will get back to you ASAP. :) Feedback, feature requests, bug reports, etc are all welcomed!
Using the library
In general, you should not need to build this library yourself. If you'd like to help contribute (we'd love to have your help :) ), then building the source and running tests is certainly necessary. You can go to our Contributor's Guide for that information and more.
This library is available for use in Maven projects from the Maven Central Repository, and can be referenced using the following dependency declaration. Be sure to see the Latest Releases to find the package name and package version that works with your version of Apache Spark!
<dependency> <groupId>com.microsoft.azure</groupId> <artifactId>azure-eventhubs-spark_[2.XX]</artifactId> <version>[LATEST]</version> </dependency> <!--- The correct artifactId and version can be found in the Latest Releases section above -->
// https://mvnrepository.com/artifact/com.microsoft.azure/azure-eventhubs-spark_2.11 libraryDependencies += "com.microsoft.azure" % "azure-eventhubs-spark_[2.XX]" % "[LATEST]"
Getting the Staging Version
We also publish a staging version of the Azure EventHubs + Apache Spark connector in GitHub. To use the staging version, two things needed to be added to your pom.xml. First add a new repository like so:
<repository> <id>azure-event-hubs-spark</id> <url>https://raw.github.com/Azure/azure-event-hubs-spark/maven-repo/</url> <snapshots> <enabled>true</enabled> <updatePolicy>always</updatePolicy> </snapshots> </repository>
Then add the following dependency declaration:
<dependency> <groupId>com.microsoft.azure</groupId> <artifactId>azure-eventhubs-spark_[2.XX]</artifactId> <version>2.1.6-SNAPSHOT</version> </dependency>
// https://mvnrepository.com/artifact/com.microsoft.azure/azure-eventhubs-spark_2.11 libraryDependencies += "com.microsoft.azure" % "azure-eventhubs-spark_2.11" % "2.1.6-SNAPSHOT"
In order to use the connector, you need to have:
- Java 1.8 SDK.
- Maven 3.x
- Scala 2.11
More details on building from source and running tests can be found in our Contributor's Guide.
mvn clean mvn install
This command builds and installs the azure-eventhubs-spark jar to your local maven cache. Subsequently, you can build any Spark Streaming application that references this jar.