This project is hyperscan wrapper for spark to allow matching large numbers (up to tens of thousands) of regular expressions
spark-shell --packages ru.napalabs.spark:spark-hscan_2.11:0.1
libraryDependencies += "ru.napalabs.spark" % "spark-hscan_2.11" % "0.1"
import ru.napalabs.spark.hscan.implicits._
spark.registerHyperscanFuncs()
val df = spark.sql("""
select * from my_table
where hlike(text_field, array("pattern.*", "[a-Z]+other"))"""
)
import ru.napalabs.spark.hscan.functions._
val df = spark.read
.format("parquet")
.load("/path/to/files")
df.where(hlike($"text_col"), Array("pattern.*", "[a-Z]+other"))
As a benchmark we used hsbench (teakettle_2500 pattern set and alexa200.db dataset)
See limitations in hyperscan-java project.
Also, this project is only with spark 2.3.2. Compatibility with other versions of spark is not guaranteed.
Feel free to raise issues or submit a pull request.