Architecture & Design
Overview
Scala URL Detector is built on top of LinkedIn's URL Detector library, providing a functional, type-safe Scala API for extracting URLs from unstructured text. The library is designed with immutability, composability, and type safety as core principles.
Architecture
High-Level Architecture
┌─────────────────────────────────────────────────────────────┐
│ User Code │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ UrlDetector API │
│ (Immutable, Type-Safe Scala Interface) │
│ │
│ • UrlDetector.extract(text: String) │
│ • withAllowed/withDenied host filtering │
│ • UrlDetectorOptions configuration │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ URL Detection Pipeline │
│ │
│ 1. Preprocess special characters │
│ 2. Detect URL candidates (LinkedIn detector) │
│ 3. Normalize URLs (encoding, protocols) │
│ 4. Parse and validate URLs (scala-uri) │
│ 5. Apply filtering (allowed/denied hosts) │
│ 6. Validate structure (TLD, email rejection, etc.) │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Set[AbsoluteUrl] │
│ (Validated, Parsed URLs) │
└─────────────────────────────────────────────────────────────┘
Core Components
1. UrlDetector
The main entry point and orchestrator for URL extraction.
Responsibilities:
- Configure detection options
- Manage host filtering (allowed/denied lists)
- Coordinate the detection pipeline
- Ensure immutability through builder pattern
Key Design:
- Immutable: All modifications return new instances
- Composable: Methods can be chained fluently
- Thread-safe: Safe to share across threads
case class UrlDetector(
options: UrlDetectorOptions,
allowedHosts: Option[NonEmptySet[Host]] = None,
deniedHosts: Set[Host] = Set.empty
)
2. UrlDetectorOptions
A sealed trait representing detection modes optimized for different content types.
Design Pattern: Sum Type (Sealed Trait + Case Objects)
Benefits:
- Compile-time exhaustiveness checking
- Type-safe option selection
- Clear, discoverable API
Implementation:
sealed trait UrlDetectorOptions
object UrlDetectorOptions {
case object Default extends UrlDetectorOptions
case object QuoteMatch extends UrlDetectorOptions
case object SingleQuoteMatch extends UrlDetectorOptions
case object BracketMatch extends UrlDetectorOptions
case object Json extends UrlDetectorOptions
case object Javascript extends UrlDetectorOptions
case object Xml extends UrlDetectorOptions
case object Html extends UrlDetectorOptions
case object AllowSingleLevelDomain extends UrlDetectorOptions
}
3. URL Representation
Uses scala-uri for type-safe URL representation.
Why scala-uri:
- Type-safe parsing and construction
- Rich API for URL manipulation
- Immutable data structures
- Well-tested and maintained
Key Types:
AbsoluteUrl: Fully qualified URLs with scheme and hostHost: Represents URL hosts with subdomain handling- URL components accessible via methods (path, query, fragment, etc.)
Detection Pipeline
Step-by-Step Processing
1. Preprocessing
Before URL detection, the text is preprocessed to handle special characters:
// Handles URLs prefixed with special characters like #, @, !, $, ~, *
// Example: "#https://example.com" → "https://example.com"
private val SpecialCharPrefixPattern = "([@#!$~*]+)(.*)"
2. Candidate Detection
Uses LinkedIn's URL Detector (Java library) to identify potential URLs:
- Leverages battle-tested URL detection logic
- Handles various URL formats and edge cases
- Language-agnostic detection algorithm
3. Normalization
Normalizes detected URLs:
Encoded Spaces:
// Converts %20 to spaces for easier processing
url.replace("%20", " ")
Protocol-Relative URLs:
// Converts //example.com → http://example.com
if (url.startsWith("//")) s"http:$url"
Default Scheme:
// Adds http:// if no scheme present
if (!hasScheme(url)) s"http://$url"
4. Parsing and Validation
Uses scala-uri to parse URLs with error handling:
Try(AbsoluteUrl.parse(normalizedUrl)) match {
case Success(url) => Some(url)
case Failure(_) => None // Gracefully skip invalid URLs
}
Validation Checks:
- URL structure validity
- Host format validation
- Port number validation (if present)
- Path encoding validation
5. Host Filtering
Applies allowed/denied host filtering with intelligent subdomain matching:
Matching Rules:
- Apex domain matching:
example.commatcheswww.example.com,api.example.com - www is implicitly assumed
- Explicit subdomain matching available
- Denied hosts take precedence over allowed hosts
Implementation:
def matchesHost(url: AbsoluteUrl, host: Host): Boolean = {
val urlHost = url.host
urlHost == host ||
urlHost.toString.endsWith(s".${host}") ||
urlHost.toString == s"www.${host}"
}
6. Additional Validation
Email Rejection:
// Uses Apache Commons Validator
!EmailValidator.getInstance().isValid(urlString)
TLD Validation:
- Validates against public suffix list
- Ensures URLs have valid top-level domains
- Can be bypassed with
AllowSingleLevelDomainoption
Userinfo Validation:
- URLs with user credentials must have explicit schemes
- Prevents
user:pass@example.comfrom being detected (could be email) - Allows
ftp://user:pass@example.com
Special Character Cleanup:
// Removes leading/trailing special characters
private val SanitizeRegex = "^[@#!$,\\-`.~*/]+|[@#!$,\\-`.~*/]+$"
Design Principles
1. Immutability
All data structures are immutable:
// Returns a NEW detector instance
def withAllowed(host: Host, hosts: Host*): UrlDetector = {
copy(allowedHosts = Some(NonEmptySet.of(host, hosts: _*)))
}
Benefits:
- Thread-safe by default
- Easier to reason about
- No defensive copying needed
- Referential transparency
2. Type Safety
Leverages Scala's type system:
// Sealed trait ensures all options are known at compile time
sealed trait UrlDetectorOptions
// Distinct types for different URL categories
trait AbsoluteUrl // Has scheme and host
trait RelativeUrl // Missing scheme or host
Benefits:
- Compile-time guarantees
- IDE autocomplete and navigation
- Refactoring safety
3. Composability
Fluent API design for easy composition:
val detector = UrlDetector(UrlDetectorOptions.Html)
.withAllowed(Host.parse("example.com"))
.withDenied(Host.parse("ads.example.com"))
.extract(htmlContent)
Benefits:
- Readable, declarative code
- Easy to build complex configurations
- Natural expression of intent
4. Fail-Safe
Graceful error handling throughout:
// Invalid URLs are filtered out, not thrown as exceptions
Try(AbsoluteUrl.parse(url)) match {
case Success(parsed) => Some(parsed)
case Failure(_) => None
}
Benefits:
- No unexpected exceptions during extraction
- Robust against malformed input
- Predictable behavior
5. Performance
Performance considerations:
- Lazy Initialization: Default detector created lazily
- Efficient Collections: Uses
Setfor O(1) lookups and deduplication - Minimal Allocations: Reuses detector instances
- Regex Compilation: Patterns compiled once at class load
object UrlDetector {
lazy val default: UrlDetector = UrlDetector(UrlDetectorOptions.Default)
}
Dependencies
Core Dependencies
LinkedIn URL Detector (0.1.23)
- Provides core URL detection algorithm
- Java library with proven detection logic
- Handles complex URL patterns and edge cases
scala-uri (4.2.0)
- Type-safe URL parsing and manipulation
- Rich API for URL components
- Well-tested URL handling
Apache Commons Validator (1.10.1)
- Email validation
- Domain validation
- IP address validation
Cats (NonEmptySet)
- Type-safe non-empty collections
- Used for allowed hosts (ensures at least one host)
- Functional programming utilities
Cross-Compilation
Supports Scala 2.12, 2.13, and 3.x:
// build.sbt
crossScalaVersions := Seq("2.12.21", "2.13.18", "3.7.4")
Compatibility:
- Uses scala-collection-compat for cross-version compatibility
- Tested against all supported Scala versions
- Binary compatible within major versions
Testing Strategy
Comprehensive test coverage using ScalaTest:
class UrlDetectorSpec extends AnyWordSpec with Matchers {
// Tests for each detection option
// Tests for host filtering
// Tests for edge cases (IPv6, encoded URLs, etc.)
// Tests for error handling
}
Test Categories:
- Detection option behavior
- Host filtering logic
- URL normalization
- Edge cases and malformed input
- Integration scenarios
Extension Points
The library can be extended:
Custom Validation
def customExtract(text: String): Set[AbsoluteUrl] = {
val urls = UrlDetector.default.extract(text)
urls.filter(customValidation)
}
def customValidation(url: AbsoluteUrl): Boolean = {
// Your custom validation logic
}
Custom Detectors
object CustomDetectors {
lazy val intranet: UrlDetector =
UrlDetector(UrlDetectorOptions.AllowSingleLevelDomain)
.withAllowed(Host.parse("corp"))
lazy val production: UrlDetector =
UrlDetector(UrlDetectorOptions.Html)
.withDenied(
Host.parse("localhost"),
Host.parse("127.0.0.1")
)
}
Future Considerations
Potential areas for enhancement:
- Async API: Support for asynchronous URL extraction
- Streaming: Process large documents as streams
- Custom Validators: Plugin architecture for validation rules
- Caching: Optional caching for repeated extractions
- Metrics: Built-in performance metrics and monitoring
- Custom TLD Lists: Allow custom public suffix lists
Performance Characteristics
Time Complexity:
- Detection: O(n) where n is text length
- Host filtering: O(m × h) where m is number of URLs, h is number of hosts
- Overall: O(n + m × h)
Space Complexity:
- O(m) where m is number of detected URLs
- Set deduplication: O(m) space for unique URLs
Optimization Tips:
- Reuse detector instances (avoid creating new ones)
- Apply host filters to reduce result set size
- Use specific detection options to reduce false positives
- For large texts, consider parallel processing