The main abstractions are special types of
ShapeLuceneRDD, which instantiate a Lucene index on each Spark executor. These
RDDs distribute search queries and aggregate search results between the Spark driver and its executors. Currently, the following queries are supported:
||Exact term search|
||Fuzzy term search|
||Query parser search|
||Record linkage via Lucene queries|
||Search within radius|
||Spatial radius linkage|
Using the query parser, you can perform prefix queries, fuzzy queries, prefix queries, etc and any combination of those. For more information on using Lucene's query parser, see Query Parser.
Here are a few examples using
LuceneRDD for full text search, spatial search and record linkage. All examples exploit Lucene's flexible query language. For spatial search,
jts are required.
For an overview of the library, check these ScalaIO 2016 Slides.
You can link against this library (for Spark 1.4+) in your program at the following coordinates:
libraryDependencies += "org.zouzias" %% "spark-lucenerdd" % "0.3.7"
<dependency> <groupId>org.zouzias</groupId> <artifactId>spark-lucenerdd_2.11</artifactId> <version>0.3.7</version> </dependency>
This library can also be added to Spark jobs launched through
spark-submit by using the
--packages command line option. For example, to include it when starting the spark shell:
$ bin/spark-shell --packages org.zouzias:spark-lucenerdd_2.11:0.3.7
--packages ensures that this library and its dependencies will be added to the classpath. The
--packages argument can also be used with
The project has the following compatibility with Apache Spark:
|Artifact||Release Date||Spark compatibility||Notes||Status|
|0.3.8-SNAPSHOT||>= 2.4.2, JVM 8||develop||Under Development|
|0.3.7||2019-04-26||>= 2.4.2, JVM 8||tag v.0.3.7||Released|
|0.3.6||2019-03-11||>= 2.4.0, JVM 8||tag v0.3.6||Released|
|0.2.8||2017-05-30||2.1.x, JVM 7||tag v0.2.8||Released|
|0.1.0||2016-09-26||1.4.x, 1.5.x, 1.6.x||tag v0.1.0||Cross-released with 2.10/2.11|
Project Status and Limitations
Implicit conversions for the primitive types (Int, Float, Double, Long, String) are supported. Moreover, implicit conversions for all product types (i.e., tuples and case classes) of the above primitives are supported. Implicits for tuples default the field names to "_1", "_2", "_3, ... following Scala's naming conventions for tuples. In addition, implicits for most Spark DataFrame types are supported (MapType and boolean are missing).
Custom Case Classes
If you want to use your own custom class with
LuceneRDD you can do it provided that your class member types are one of the primitive types (Int, Float, Double, Long, String).
For more details, see
LuceneRDDCustomcaseClassImplicits under the tests directory.
A docker compose script is setup with some preliminary notebook in Zeppelin, run
For more LuceneRDD examples on Zeppelin, check these examples
Build from Source
Install Java, SBT and clone the project
git clone https://github.com/zouzias/spark-lucenerdd.git cd spark-lucenerdd sbt compile assembly
The above will create an assembly jar containing spark-lucenerdd functionality under