music-of-the-ainur / solr.almaren   0.3.4-2.4

Apache License 2.0 GitHub

Solr Connector For Almaren Framework

Scala versions: 2.11

Solr Connector

Solr-Almaren

Solr Connector was implemented using https://github.com/lucidworks/spark-solr. The Solr Connector just works on Solr Cloud. For all the options available for the connector check on this link.

To add Solr Almaren dependency to your sbt build:

libraryDependencies += "com.github.music-of-the-ainur" %% "solr-almaren" % "0.3.5-3.4"

To run in spark-shell:

spark-shell --master "local[*]" --packages "com.github.music-of-the-ainur:almaren-framework_2.12:0.9.10-3.4,com.github.music-of-the-ainur:solr-almaren_2.12:0.3.5-3.4"

Solr Connector is available in Maven Central repository.

version Connector Artifact
Spark 3.4.x and scala 2.12 com.github.music-of-the-ainur:solr-almaren_2.12:0.3.5-3.4
Spark 3.3.x and scala 2.12 com.github.music-of-the-ainur:solr-almaren_2.12:0.3.5-3.3
Spark 3.2.x and scala 2.12 com.github.music-of-the-ainur:solr-almaren_2.12:0.3.5-3.2
Spark 3.1.x and scala 2.12 com.github.music-of-the-ainur:solr-almaren_2.12:0.3.5-3.1
Spark 2.4.x and scala 2.11 com.github.music-of-the-ainur:solr-almaren_2.11:0.3.5-2.4

Source and Target

Source

Parameteres

Parameters Description
collection collection name
ZookeeperHost(zkhost) localhost:9983
options Description(Value)
----------------------- ------------------------------------------------------------------------------
query limits the rows you want to load into Spark("body_t:solr")
fields specify a subset of fields("id,author_s,favorited_b")
filters to apply filters on the values in documents("firstName:Sam,lastName:Powell")
rows specify the number of rows to be displayed on the page(100)
max_rows Limits the result set to a maximum number of rows(5000)
request_handler Set the Solr request handler for queries("/export","/select")

Example

import com.github.music.of.the.ainur.almaren.Almaren
import com.github.music.of.the.ainur.almaren.solr.Solr.SolrImplicit

val almaren = Almaren("App Name")

almaren.builder.sourceSolr("collection","zkHost1:2181,zkHost2:2181",Map("field_names" -> "first_name,last_name","rows" -> 100))

almaren.builder.targetSolr("collection","zkHost1:2181,zkHost2:2181",options)

Target:

Parameters

Parameters Description
collection collection name
ZookeeperHost(zkhost) localhost:9983
Savemode SaveMode.ErrorIfExists
options Description(Value)
----------------------- ----------------------------------------------------------------
soft_commit_secs set soft_commit_sec(10 seconds)
commit_within force commit to happen after specified time(5000 milliSeconds)
batch_size number of documents sent in a HTTP call (1000)
gen_uniq_key generating unique key for each document(true)
solr_field_types specify field types for solr("rating:string,title:text_en")

Example

import com.github.music.of.the.ainur.almaren.solr.Solr.SolrImplicit
import com.github.music.of.the.ainur.almaren.builder.Core.Implicit
import com.github.music.of.the.ainur.almaren.Almaren
import org.apache.spark.sql.SaveMode

val almaren = Almaren("App Name")

almaren.builder
    .sourceSql("""SELECT sha2(concat_ws("",array(*)),256) as id,*,current_timestamp from deputies""")
    .coalesce(30)
    .targetSolr("deputies","cloudera:2181,cloudera1:2181,cloudera2:2181/solr",Map("batch_size" -> "100000","commit_within" -> "10000"),SaveMode.Overwrite)
    .batch