Scala URL Detector

A robust Scala library that detects and extracts URLs from unstructured text with support for multiple content formats.

Based on LinkedIn Engineering's URL Detector, this library provides a type-safe, functional Scala API for extracting URLs from text in various formats including HTML, XML, JSON, and JavaScript.

Features

Multiple Detection Modes: Support for HTML, XML, JSON, JavaScript, and plain text
Smart URL Parsing: Handles URLs with or without schemes, protocol-relative URLs, and encoded characters
Host Filtering: Allow or deny specific hosts with intelligent subdomain matching
Format-Aware Extraction: Context-aware detection for different content types (quotes, brackets, delimiters)
IPv4 & IPv6 Support: Recognizes both IPv4 and IPv6 addresses
TLD Validation: Validates URLs against public suffix lists
Email Filtering: Automatically excludes email addresses from detection
Type-Safe API: Uses scala-uri for strongly-typed URL representations
Cross-Platform: Published for Scala 2.12, 2.13, and 3.x

Quick Start

Add the following to your build.sbt:

libraryDependencies += "io.lambdaworks" %% "scurl-detector" % "version"

Basic Usage

import io.lambdaworks.detection.UrlDetector
import io.lemonlabs.uri.AbsoluteUrl

// Use default detector
val detector = UrlDetector.default
val urls: Set[AbsoluteUrl] = detector.extract("Check out https://example.com and lambdaworks.io")

// Use with specific options
import io.lambdaworks.detection.UrlDetectorOptions

val jsonDetector = UrlDetector(UrlDetectorOptions.Json)
val extractedUrls = jsonDetector.extract("""{"url": "https://api.example.com/v1"}""")

// Filter by allowed hosts
import io.lemonlabs.uri.Host

val filtered = UrlDetector.default
  .withAllowed(Host.parse("lambdaworks.io"))
  .extract("Visit lambdaworks.io and example.com")  // Only returns lambdaworks.io

Detection Options

The library supports 9 different detection modes optimized for various content types:

Default: Basic URL detection
QuoteMatch: Handles double-quoted URLs
SingleQuoteMatch: Handles single-quoted URLs
BracketMatch: Handles URLs in brackets/parentheses
Json: Optimized for JSON content
Javascript: Optimized for JavaScript code
Xml: Optimized for XML documents
Html: Optimized for HTML content
AllowSingleLevelDomain: Allows single-level domains like http://localhost

See the full documentation for detailed information about each option.

Documentation

Contributing

We welcome contributions! Please see our Contribution Guidelines for details.

License

This project is licensed under Apache 2.0.

lambdaworks / scurl-detector 1.3.0

Scala URL Detector

Features

Quick Start

Basic Usage

Detection Options

Documentation

Contributing

License

Community

Statistics

4 Dependencies

No Dependent