A connector for reading and writing data between Apache Spark and OpenSearch. It enables Spark jobs to directly index data into OpenSearch and run queries against it, with parallel reads and writes across Spark partitions and OpenSearch shards for efficient distributed processing.
Also supports Apache Hive and Hadoop Map/Reduce. Works with any OpenSearch cluster accessible via REST, including Amazon OpenSearch Service and Amazon OpenSearch Serverless.
Use cases:
- Index large datasets from Spark ETL pipelines into OpenSearch
- Query OpenSearch from Spark for analytics, reporting, and machine learning
- Build search-powered applications backed by Spark data pipelines
- Bridge your data lake or lakehouse with search and analytics on OpenSearch
Write a Spark DataFrame to OpenSearch and read it back, using PySpark:
pyspark --packages org.opensearch.client:opensearch-spark-30_2.12:2.0.0 \
--conf spark.opensearch.nodes=localhost \
--conf spark.opensearch.nodes.wan.only=true# Write
df = spark.createDataFrame([("hello", 1), ("world", 2)], ["name", "value"])
df.write.format("opensearch").save("my-index")
# Read
result = spark.read.format("opensearch").load("my-index")
result.show()For Scala, Java, Spark SQL, RDD, and more examples, see the User Guide. For Amazon OpenSearch Service and OpenSearch Serverless, see the User Guide.
Choose the artifact that matches your Spark and Scala version:
| Spark Version | Scala Version | Artifact |
|---|---|---|
| 3.4.x | 2.12 | org.opensearch.client:opensearch-spark-30_2.12:2.0.0 |
| 3.4.x | 2.13 | org.opensearch.client:opensearch-spark-30_2.13:2.0.0 |
| 3.5.x | 2.12 | org.opensearch.client:opensearch-spark-35_2.12:2.0.0 |
| 3.5.x | 2.13 | org.opensearch.client:opensearch-spark-35_2.13:2.0.0 |
| 4.x | 2.13 | org.opensearch.client:opensearch-spark-40_2.13:2.0.0 |
For Map/Reduce and Hive, see org.opensearch.client:opensearch-hadoop-mr and org.opensearch.client:opensearch-hadoop-hive on Maven Central.
- OpenSearch 1.x or later (including Amazon OpenSearch Service and Serverless)
- Java 11 or later at runtime
- Java 21 to build from source
- For SigV4 IAM authentication, additional AWS SDK dependencies are required. See the User Guide.
See COMPATIBILITY.md.
OpenSearch Hadoop uses Gradle for its build system. JDK 21 is required.
./gradlew build # build and run unit tests
./gradlew integrationTests # run integration tests
./gradlew distZip # create distributable zip- User Guide — usage examples for Spark, Hive, and Map/Reduce
- Compatibility — supported versions of OpenSearch, Spark, and Scala
- Contributing
- Changelog
This project is licensed under the Apache License, Version 2.0.