A robust Scala library that detects and extracts URLs from unstructured text with support for multiple content formats.
Based on LinkedIn Engineering's URL Detector, this library provides a type-safe, functional Scala API for extracting URLs from text in various formats including HTML, XML, JSON, and JavaScript.
- Multiple Detection Modes: Support for HTML, XML, JSON, JavaScript, and plain text
- Smart URL Parsing: Handles URLs with or without schemes, protocol-relative URLs, and encoded characters
- Host Filtering: Allow or deny specific hosts with intelligent subdomain matching
- Format-Aware Extraction: Context-aware detection for different content types (quotes, brackets, delimiters)
- IPv4 & IPv6 Support: Recognizes both IPv4 and IPv6 addresses
- TLD Validation: Validates URLs against public suffix lists
- Email Filtering: Automatically excludes email addresses from detection
- Type-Safe API: Uses scala-uri for strongly-typed URL representations
- Cross-Platform: Published for Scala 2.12, 2.13, and 3.x
Add the following to your build.sbt:
libraryDependencies += "io.lambdaworks" %% "scurl-detector" % "version"import io.lambdaworks.detection.UrlDetector
import io.lemonlabs.uri.AbsoluteUrl
// Use default detector
val detector = UrlDetector.default
val urls: Set[AbsoluteUrl] = detector.extract("Check out https://example.com and lambdaworks.io")
// Use with specific options
import io.lambdaworks.detection.UrlDetectorOptions
val jsonDetector = UrlDetector(UrlDetectorOptions.Json)
val extractedUrls = jsonDetector.extract("""{"url": "https://api.example.com/v1"}""")
// Filter by allowed hosts
import io.lemonlabs.uri.Host
val filtered = UrlDetector.default
.withAllowed(Host.parse("lambdaworks.io"))
.extract("Visit lambdaworks.io and example.com") // Only returns lambdaworks.ioThe library supports 9 different detection modes optimized for various content types:
- Default: Basic URL detection
- QuoteMatch: Handles double-quoted URLs
- SingleQuoteMatch: Handles single-quoted URLs
- BracketMatch: Handles URLs in brackets/parentheses
- Json: Optimized for JSON content
- Javascript: Optimized for JavaScript code
- Xml: Optimized for XML documents
- Html: Optimized for HTML content
- AllowSingleLevelDomain: Allows single-level domains like
http://localhost
See the full documentation for detailed information about each option.
We welcome contributions! Please see our Contribution Guidelines for details.
This project is licensed under Apache 2.0.