Advanced Usage

Host Filtering with Subdomains

The URL detector provides intelligent subdomain handling when filtering hosts.

Subdomain Matching Rules

When you specify a host, the detector automatically handles subdomains intelligently:

import io.lambdaworks.detection.UrlDetector
import io.lemonlabs.uri.Host

// Specifying "example.com" will match:
// - example.com
// - www.example.com (www is implicit)
// - api.example.com
// - any.subdomain.example.com

val detector = UrlDetector.default.withAllowed(Host.parse("example.com"))
val urls = detector.extract("""
  https://example.com
  https://www.example.com
  https://api.example.com
  https://cdn.us-east.example.com
""")

// All URLs will be extracted

Denying Specific Subdomains

You can deny specific subdomains while allowing others:

val detector = UrlDetector.default
  .withAllowed(Host.parse("example.com"))
  .withDenied(Host.parse("ads.example.com"))

val text = """
  https://example.com
  https://api.example.com
  https://ads.example.com
  https://tracking.ads.example.com
"""

val urls = detector.extract(text)
// Returns: example.com, api.example.com
// Excludes: ads.example.com, tracking.ads.example.com

Explicit Subdomain Filtering

To match only a specific subdomain (not its children):

// This will match api.example.com and www.api.example.com
// but NOT v1.api.example.com
val detector = UrlDetector.default.withAllowed(Host.parse("api.example.com"))

Protocol-Relative URLs

The detector handles protocol-relative URLs (starting with //) by converting them to HTTP:

val detector = UrlDetector.default
val urls = detector.extract("Load //cdn.example.com/script.js")

// Returns: http://cdn.example.com/script.js

Missing Scheme Handling

URLs without schemes are automatically assigned the HTTP scheme:

val detector = UrlDetector.default
val urls = detector.extract("Visit example.com and www.github.com")

// Returns:
// - http://example.com
// - http://www.github.com

IPv4 and IPv6 Support

The detector recognizes both IPv4 and IPv6 addresses:

val detector = UrlDetector.default

// IPv4
val ipv4Urls = detector.extract("API at http://192.168.1.1:8080/api")
// Returns: http://192.168.1.1:8080/api

// IPv6
val ipv6Urls = detector.extract("Server at http://[2001:db8::1]:8080/")
// Returns: http://[2001:db8::1]:8080/

URL Encoding and Special Characters

The detector handles URL-encoded characters, particularly encoded spaces:

val detector = UrlDetector.default
val urls = detector.extract("Download from https://example.com/my%20file.pdf")

// The URL is properly extracted with encoding preserved

Email Address Filtering

The detector automatically filters out email addresses to avoid false positives:

val detector = UrlDetector.default
val text = "Contact us at support@example.com or visit https://example.com"

val urls = detector.extract(text)
// Returns only: https://example.com
// Excludes: support@example.com

Handling URLs with User Info

URLs containing user credentials require explicit schemes:

val detector = UrlDetector.default

// This will be extracted (has explicit scheme)
val withScheme = detector.extract("ftp://user:pass@ftp.example.com")
// Returns: ftp://user:pass@ftp.example.com

// This will be rejected (no scheme with user info)
val noScheme = detector.extract("user:pass@example.com")
// Returns: empty set (rejected as potentially an email variant)

Custom Detection Pipelines

You can create reusable detector configurations for different use cases:

object Detectors {
  // For public web scraping - only allow common public domains
  lazy val publicWeb: UrlDetector = UrlDetector(UrlDetectorOptions.Html)
    .withDenied(
      Host.parse("localhost"),
      Host.parse("127.0.0.1"),
      Host.parse("0.0.0.0")
    )

  // For API response parsing
  lazy val apiResponses: UrlDetector = UrlDetector(UrlDetectorOptions.Json)

  // For development environments
  lazy val development: UrlDetector =
    UrlDetector(UrlDetectorOptions.AllowSingleLevelDomain)

  // For secure contexts - HTTPS only (filter applied post-extraction)
  def httpsOnly: UrlDetector = UrlDetector.default

  def extractHttpsOnly(text: String): Set[AbsoluteUrl] = {
    httpsOnly.extract(text).filter(_.schemeOption.contains("https"))
  }
}

// Usage
val webUrls = Detectors.publicWeb.extract(htmlContent)
val apiUrls = Detectors.apiResponses.extract(jsonResponse)
val localUrls = Detectors.development.extract(configFile)
val secureUrls = Detectors.extractHttpsOnly(userInput)

Processing Large Text

For large documents, consider chunking the text and processing in parallel:

import scala.concurrent.{Future, ExecutionContext}
import scala.concurrent.ExecutionContext.Implicits.global

def extractUrlsParallel(text: String, chunkSize: Int = 10000)
                       (implicit ec: ExecutionContext): Future[Set[AbsoluteUrl]] = {
  val chunks = text.grouped(chunkSize).toSeq
  val detector = UrlDetector.default

  val futures = chunks.map { chunk =>
    Future {
      detector.extract(chunk)
    }
  }

  Future.sequence(futures).map(_.flatten.toSet)
}

// Usage
val largeText = // ... load large document
val urlsFuture = extractUrlsParallel(largeText)

Validating Extracted URLs

You can add additional validation after extraction:

import io.lemonlabs.uri.AbsoluteUrl

def validateUrls(urls: Set[AbsoluteUrl]): Set[AbsoluteUrl] = {
  urls.filter { url =>
    // Only allow standard ports
    url.port.forall(p => p == 80 || p == 443 || p == 8080)
  }.filter { url =>
    // Only allow certain schemes
    url.schemeOption.exists(s => Set("http", "https").contains(s))
  }.filter { url =>
    // Exclude URLs with certain path patterns
    !url.path.toString().contains("/admin/")
  }
}

val detector = UrlDetector.default
val allUrls = detector.extract(text)
val validUrls = validateUrls(allUrls)

Extracting and Categorizing URLs

Group and categorize extracted URLs:

case class UrlCategory(
  apis: Set[AbsoluteUrl],
  assets: Set[AbsoluteUrl],
  pages: Set[AbsoluteUrl],
  other: Set[AbsoluteUrl]
)

def categorizeUrls(urls: Set[AbsoluteUrl]): UrlCategory = {
  val (apis, rest1) = urls.partition(_.path.toString().contains("/api/"))
  val (assets, rest2) = rest1.partition { url =>
    val path = url.path.toString().toLowerCase
    path.endsWith(".css") || path.endsWith(".js") ||
    path.endsWith(".png") || path.endsWith(".jpg")
  }
  val (pages, other) = rest2.partition { url =>
    val path = url.path.toString().toLowerCase
    path.isEmpty || path.endsWith(".html") || path.endsWith("/")
  }

  UrlCategory(apis, assets, pages, other)
}

// Usage
val detector = UrlDetector(UrlDetectorOptions.Html)
val urls = detector.extract(htmlContent)
val categorized = categorizeUrls(urls)

println(s"APIs: ${categorized.apis.size}")
println(s"Assets: ${categorized.assets.size}")
println(s"Pages: ${categorized.pages.size}")
println(s"Other: ${categorized.other.size}")

Integration with HTTP Clients

Use extracted URLs with HTTP clients:

import sttp.client3._

def validateExtractedUrls(text: String): Future[Map[AbsoluteUrl, Boolean]] = {
  val detector = UrlDetector.default
  val urls = detector.extract(text)
  val backend = HttpURLConnectionBackend()

  val results = urls.map { url =>
    val request = basicRequest.get(uri"${url.toString}").response(asString)
    val response = request.send(backend)
    url -> response.code.isSuccess
  }.toMap

  Future.successful(results)
}

Thread Safety

UrlDetector instances are immutable and thread-safe. You can safely share detector instances across threads:

object SharedDetectors {
  // Safe to use across multiple threads
  val default: UrlDetector = UrlDetector.default
  val html: UrlDetector = UrlDetector(UrlDetectorOptions.Html)
  val json: UrlDetector = UrlDetector(UrlDetectorOptions.Json)
}

// Safe concurrent usage
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global

def processMultipleTexts(texts: Seq[String]): Future[Seq[Set[AbsoluteUrl]]] = {
  Future.traverse(texts) { text =>
    Future {
      SharedDetectors.default.extract(text)
    }
  }
}

Performance Optimization Tips

Reuse Detector Instances: Create detector instances once and reuse them
Choose Specific Options: Use the most specific detection option for your content type
Apply Host Filters: If you know you only need specific hosts, apply filters to reduce processing
Pre-filter Text: If possible, exclude large sections of text that definitely don't contain URLs
Batch Processing: Process multiple documents with the same detector instance

// Good: Reuse detector instance
val detector = UrlDetector(UrlDetectorOptions.Json)
val results = documents.map(doc => detector.extract(doc))

// Less efficient: Create new detector each time
val results = documents.map { doc =>
  UrlDetector(UrlDetectorOptions.Json).extract(doc)
}

Host Filtering with Subdomains​

Subdomain Matching Rules​

Denying Specific Subdomains​

Explicit Subdomain Filtering​

Protocol-Relative URLs​

Missing Scheme Handling​

IPv4 and IPv6 Support​

URL Encoding and Special Characters​

Email Address Filtering​

Handling URLs with User Info​

Custom Detection Pipelines​

Processing Large Text​

Validating Extracted URLs​

Extracting and Categorizing URLs​

Integration with HTTP Clients​

Thread Safety​

Performance Optimization Tips​