lambdaworks / scurl-detector   1.3.0

Apache License 2.0 Website GitHub

Scala library that detects and extracts URLs from text.

Scala versions: 3.x 2.13 2.12

scala-version Scala CI License Release Version Snapshot Version

Scala URL Detector

A robust Scala library that detects and extracts URLs from unstructured text with support for multiple content formats.

Based on LinkedIn Engineering's URL Detector, this library provides a type-safe, functional Scala API for extracting URLs from text in various formats including HTML, XML, JSON, and JavaScript.

Features

  • Multiple Detection Modes: Support for HTML, XML, JSON, JavaScript, and plain text
  • Smart URL Parsing: Handles URLs with or without schemes, protocol-relative URLs, and encoded characters
  • Host Filtering: Allow or deny specific hosts with intelligent subdomain matching
  • Format-Aware Extraction: Context-aware detection for different content types (quotes, brackets, delimiters)
  • IPv4 & IPv6 Support: Recognizes both IPv4 and IPv6 addresses
  • TLD Validation: Validates URLs against public suffix lists
  • Email Filtering: Automatically excludes email addresses from detection
  • Type-Safe API: Uses scala-uri for strongly-typed URL representations
  • Cross-Platform: Published for Scala 2.12, 2.13, and 3.x

Quick Start

Add the following to your build.sbt:

libraryDependencies += "io.lambdaworks" %% "scurl-detector" % "version"

Basic Usage

import io.lambdaworks.detection.UrlDetector
import io.lemonlabs.uri.AbsoluteUrl

// Use default detector
val detector = UrlDetector.default
val urls: Set[AbsoluteUrl] = detector.extract("Check out https://example.com and lambdaworks.io")

// Use with specific options
import io.lambdaworks.detection.UrlDetectorOptions

val jsonDetector = UrlDetector(UrlDetectorOptions.Json)
val extractedUrls = jsonDetector.extract("""{"url": "https://api.example.com/v1"}""")

// Filter by allowed hosts
import io.lemonlabs.uri.Host

val filtered = UrlDetector.default
  .withAllowed(Host.parse("lambdaworks.io"))
  .extract("Visit lambdaworks.io and example.com")  // Only returns lambdaworks.io

Detection Options

The library supports 9 different detection modes optimized for various content types:

  • Default: Basic URL detection
  • QuoteMatch: Handles double-quoted URLs
  • SingleQuoteMatch: Handles single-quoted URLs
  • BracketMatch: Handles URLs in brackets/parentheses
  • Json: Optimized for JSON content
  • Javascript: Optimized for JavaScript code
  • Xml: Optimized for XML documents
  • Html: Optimized for HTML content
  • AllowSingleLevelDomain: Allows single-level domains like http://localhost

See the full documentation for detailed information about each option.

Documentation

Contributing

We welcome contributions! Please see our Contribution Guidelines for details.

License

This project is licensed under Apache 2.0.