A Scala web scraping library, based on Scalext, for building Akka actor systems that scrape and collect data from any type of website.
Scalescrape is available on Maven Central (since version 0.4.0), and it is cross compiled and published for Scala 2.12 and 2.11.
Older artifacts versions are not available anymore due to the shutdown of my self-hosted Nexus Repository in favour of Bintray
Using SBT, add the following dependency to your build file:
libraryDependencies ++= Seq(
"io.bfil" %% "scalescrape" % "0.4.1"
)If you have issues resolving the dependency, you can add the following resolver:
resolvers += Resolver.bintrayRepo("bfil", "maven")The library offers to main actor traits that can be extended:
- A
ScrapingActor: which can be used to define the web scraping logic of an actor - A
CollectionActor: which can be used to communicate to aScrapingActorand collect all the data needed
The following example can be used to get some insight of how to use the library
The first step is to try to create a representation of the website that we are going to scrape, something like the following:
class ExampleWebsite {
private val baseUrl = "http://www.example.com"
val homePage = s"$baseUrl/home"
def loginForm(username: String, password: String) =
Form(s"$baseUrl/vm_sso/idp/login.action", Map(
"username" -> username,
"password" -> password))
def updateAccountEmailRequest(newEmail: String) =
Request(s"$baseUrl/account/update", s"""{"email": "$newEmail" }""")
}The ExampleWebsite defines the url of the homepage, a login form and request object that can be used to update the account email on the example website.
Form and Request are part of the library, and are used to define forms or requests that you need to do in order to scrape the website.
The following will be the message protocol used by the actors to communicate:
object ExampleProtocol {
case class UpdateAccountEmailWithCredentials(username: String, password: String, newEmail: String)
case class Login(username: String, password: String)
case object LoggedIn
case object LoginFailed
case class UpdateAccountEmail(newEmail: String)
case object EmailUpdated
case object EmailUpToDate
}An example scraping actor can be defined like this:
class ExampleScraper extends ScrapingActor {
// actor logic
}In our actor logic we are going to create an instance of our ExampleWebsite for later use, we also create a variable to store some session cookies:
val website = new ExampleWebsite
var savedCookies: Map[String, HttpCookie] = Map.emptyIn order to do anything on the website we have to login first, so let's define a method on the actor that logs a user in using his credentials:
private def login(username: String, password: String) =
scrape { // (1)
postForm(website.loginForm(username, password)) { response => // (2)
response.asHtml { doc => // (3)
doc.$("title").text match { // (4)
case "Login error" => complete(LoginFailed) // (5)
case _ =>
cookies { cookies => // (6)
savedCookies = cookies // (7)
complete(LoggedIn) // (8)
}
}
}
}
}- Uses the
scrapemethod to initialize the scraping action - Posts the login form using the
postFormmethod, passing theForminstance - Parse the response as HTML and provides a
JSoupdocument - Uses
JSoupto get the text of the title tag - If the title tag is "Login error" it completes by sending back
LoginFailed - Otherwise it gets the cookies from the current session
- Stores the session cookies in our actor variable
- Completes and sends back
LoggedIn
Please note that the ScrapingActor retains the cookies automatically between requests that are part of the same action (between scrape and complete), the cookies can be manipulated using the actions addCookie, dropCookie, and withCookies.
After logging in we can used the session cookies to perform other actions as authenticated users, let's create a method to update our email address on the website
private def updateAccountEmail(newEmail: String) =
scrape { // (1)
withCookies(savedCookies) { // (2)
get(website.homePage) { response => // (3)
response.asHtml { doc => // (4)
val currentEmail = doc.$("#account-email").text // (5)
if (currentEmail != newEmail) { // (6)
post(website.updateAccountEmailRequest(newEmail)) { response => // (7)
response.asJson { jsonResponse => // (8)
(jsonResponse \ "error") match {
case JString(message) => fail // (9)
case _ => complete(EmailUpdated) // (10)
}
}
}
} else complete(EmailUpToDate) // (11)
}
}
}
}- Uses the
scrapemethod to initialize the scraping action - Adds the session cookies we saved previously to the scraping context so that they will be sent with the following requests
- Gets the homepage of the example website
- Parses the response as HTML
- Gets the value of the current account email from the
JSoupdocument - If the current email is different from the one we want to set
- Posts a JSON request to the website to update our email
- Parses the response as JSON and checks if there is an error message
- Fails if the update email response contains an error message
- Completes and sends back
EmailUpdatedif the email update was successful - If the current email is the same as the one we want to set we complete and send back
EmailUpToDate
Finally, we can define our actor's receive method:
def receive = {
case Login(username, password) => login(username, password)
case UpdateAccountEmail(newEmail) => updateAccountEmail(newEmail)
}This actor can now be used to login to our example website and update our email address by sending the appropriate messages to it.
Let's continue and create an actor that performs both actions for us.
An example collection actor can be defined like this:
class ExampleCollector extends CollectionActor[ExampleScraper] {
// actor logic
}Here's the actor logic:
def receive = {
case UpdateAccountEmailWithCredentials(username, password, newEmail) =>
collect { // (1)
askTo(Login(username, password)) { // (2)
case LoggedIn => // (3)
askTo(UpdateAccountEmail(newEmail)) { // (4)
case x => complete(x) // (5)
}
case LoginFailed => complete(LoginFailed) // (6)
}
}
}- Uses the
collectmethod to initialize the collection action by creating anExampleScraperactor under the hood - Asks the scraper to login with the credentials received
- If the scraper returns
LoggedIn - It goes on by asking it to
UpdateAccountEmailwith the new email - Then it completes and sends back whatever is received by the scraper as the response of the action (the complete action kills the internal scraping actor)
- In case the login fails it sends back
LoginFailed
This was a simple example of some of the capabilities of the library, for more details use the documentation.
The main components of Scalescrape are the ScrapingActor and the CollectionActor traits.
To understand the details of the internal mechanics of the DSL read the documentation of Scalext.
You can create a scraping Akka actor and use the scraping DSL by extending the ScrapingActor trait.
scrape
def scrape[T](scrapingAction: Action)(implicit ac: ActorContext): UnitIt creates a ScrapingContext with a reference to the current message sender and an empty cookie jar, and passes it to the inner action:
scape {
ctx => println(ctx.requestor, ctx.cookies) // current sender, cookies
}get
def get(url: String)(implicit ec: ExecutionContext, ac: ActorContext): ChainableAction1[HttpResponse]It sends a GET request to the url provided and passes the response into the inner action:
get("http://www.example.com/home") { response =>
ctx => Unit
}post
def post[T](request: Request[T])(implicit ec: ExecutionContext, ac: ActorContext): ChainableAction1[HttpResponse]It sends the POST request and passes the response into the inner action:
post(Request("http://www.example.com/update", "some data")) { response =>
ctx => Unit
}postForm
def postForm[T](form: Form)(implicit ec: ExecutionContext, ac: ActorContext): ChainableAction1[HttpResponse]It sends the POST request with form data and passes the response into the inner action:
postForm("http://www.example.com/submit-form", Map("some" -> "data")) { response =>
ctx => Unit
}put
def put[T](request: Request[T])(implicit ec: ExecutionContext, ac: ActorContext): ChainableAction1[HttpResponse]It sends the PUT request and passes the response into the inner action:
put(Request("http://www.example.com/update", "some data")) { response =>
ctx => Unit
}delete
def delete[T](request: Request[T])(implicit ec: ExecutionContext, ac: ActorContext): ChainableAction1[HttpResponse]It sends the DELETE request and passes the response into the inner action:
delete(Request("http://www.example.com/update", "some data")) { response =>
ctx => Unit
}cookies
def cookies: ChainableAction1[Map[String, HttpCookie]]It extracts the cookies from the current contexts and passes them into the inner function:
cookies { cookies =>
ctx => Unit
}withCookies
def withCookies(cookies: Map[String, HttpCookie]): ChainableAction0It replaces the cookies of the current contexts with the ones specified and calls the inner function with the new context:
withCookies(newCookies) {
ctx => Unit
}addCookie
def addCookie(cookie: HttpCookie): ChainableAction0Adds a cookie to the current contexts and calls the inner function with the new context:
addCookie(newCookie) {
ctx => Unit
}dropCookie
def dropCookie(cookieName: String): ChainableAction0Adds a cookie to the current contexts and calls the inner function with the new context:
dropCookie("someCookie") {
ctx => Unit
}complete
def complete[T](message: Any): ActionResultCompletes the scraping action by sending the specified message back to the original sender:
complete("done")fail
def fail: ActionResultReturns an Akka status failure message back to the original sender:
failYou can create a collection Akka actor and use the collection DSL by extending the CollectionActor[T] trait, where T is a ScrapingActor.
collect
def collect(collectionAction: Action)(implicit tag: ClassTag[Scraper], ac: ActorContext): UnitIt spawns an instance of the ScarpingActor specified as a type parameter under the hood. It creates a CollectionContext with a reference to the scraping actor and to the current message sender, and passes the context to the inner action:
collect {
ctx => println(ctx.requestor, ctx.scraper) // current sender, scraping actor
}collectUsingScraper
def collectUsingScraper(scraper: ActorRef)(collectionAction: Action)(implicit ac: ActorContext): UnitIt creates a CollectionContext with a reference to the scraping actor specified and to the current message sender, and passes the context to the inner action:
collectUsingScraper(myScrapingActor) {
ctx => println(ctx.requestor, ctx.scraper) // current sender, scraping actor
}askTo
def askTo(messages: Any)(implicit ec: ExecutionContext): ChainableAction1[Any]It sends messages (using akka.pattern.ask) to the scraping actor in the collection context and passes the received messages to the inner action:
Please note: it currently handles correctly only up to 3 parameters.
askTo("say hello") {
case "hello" => complete("thanks")
case _ => fail
}
askTo("say hello", "say world") {
case ("hello", "world") => complete("thanks")
case _ => fail
}
askTo("say hello", "say world", "say bye") {
case ("hello", "world", "bye") => complete("bye")
case _ => fail
}scraper
def scraper: ChainableAction1[ActorRef]It extracts the scraper from the current contexts and passes them into the inner function:
scraper { scraper =>
ctx => Unit
}withScraper
def withScraper(scraper: ActorRef): ChainableAction0It replaces the scraper of the current contexts with the ones specified and calls the inner function with the new context:
withScraper(newScraper) {
ctx => Unit
}notify
def notify[T](message: Any): ChainableAction0Sends a message back to the original sender and calls the inner action:
notify("hello") {
ctx => Unit
}complete
def complete[T](message: Any): ActionResultCompletes the collection action by sending the specified message back to the original sender:
complete("done")keepAlive
def keepAlive: ActionResultCompletes the collection action by not sending any message back to the original sender and keeping the scraping actor alive:
keepAlivefail
def fail: ActionResultReturns an Akka status failure message back to the original sender and kills the scraping actor:
failThis software is licensed under the Apache 2 license, quoted below.
Copyright © 2014-2017 Bruno Filippone http://bfil.io
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
[http://www.apache.org/licenses/LICENSE-2.0]
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.