FAQ & Troubleshooting

Frequently Asked Questions

General Questions

Q: What is Scala URL Detector?

A: Scala URL Detector is a library that extracts URLs from unstructured text. It's built on LinkedIn's URL Detector with a functional, type-safe Scala API using scala-uri for URL representation.

Q: Which Scala versions are supported?

A: The library supports Scala 2.12.x, 2.13.x, and 3.x with full cross-compilation.

Q: Is this library thread-safe?

A: Yes, all components are immutable and thread-safe. You can safely share UrlDetector instances across multiple threads.

Q: What license is this library under?

A: Apache License 2.0

Usage Questions

Q: Why are some URLs not being detected?

Possible reasons:

Invalid URL format: The URL doesn't conform to valid URL syntax

// Won't be detected - invalid format
val urls = detector.extract("Invalid: ht!tp://bad")

Email addresses: Emails are automatically filtered out

// Won't be detected - looks like email
val urls = detector.extract("Contact: user@example.com")

Missing TLD: URLs without valid top-level domains are rejected

// Won't be detected - no valid TLD
val urls = UrlDetector.default.extract("http://notatld")

// Solution: Use AllowSingleLevelDomain
val urls = UrlDetector(UrlDetectorOptions.AllowSingleLevelDomain)
  .extract("http://localhost")

Host filtering: The URL is excluded by withAllowed or withDenied

val detector = UrlDetector.default.withAllowed(Host.parse("example.com"))
// Won't detect other.com
val urls = detector.extract("Visit other.com")

Q: Why is my URL extracted with `http://` instead of `https://`?

A: URLs without explicit schemes are assigned http:// by default. Always include the scheme in your URLs:

val detector = UrlDetector.default

// Without scheme - gets http://
detector.extract("example.com") // Returns: http://example.com

// With scheme - preserved
detector.extract("https://example.com") // Returns: https://example.com

Post-filtering solution:

val allUrls = detector.extract(text)
val httpsOnly = allUrls.filter(_.schemeOption.contains("https"))

Q: How do I detect localhost URLs?

A: Use the AllowSingleLevelDomain option:

import io.lambdaworks.detection.{UrlDetector, UrlDetectorOptions}

val detector = UrlDetector(UrlDetectorOptions.AllowSingleLevelDomain)
val urls = detector.extract("Development: http://localhost:8080")
// Successfully detects: http://localhost:8080

Q: Can I extract URLs from multiple formats in one pass?

A: No, each detector uses a single detection mode. For multiple formats, create separate detectors or process in multiple passes:

val htmlUrls = UrlDetector(UrlDetectorOptions.Html).extract(text)
val jsonUrls = UrlDetector(UrlDetectorOptions.Json).extract(text)
val allUrls = htmlUrls ++ jsonUrls

Q: How do I handle relative URLs?

A: The library only extracts absolute URLs. Relative URLs are not detected. If you need to handle relative URLs, you'll need to convert them to absolute URLs first:

import io.lemonlabs.uri.{AbsoluteUrl, RelativeUrl}

val baseUrl = AbsoluteUrl.parse("https://example.com")
val relativeUrl = RelativeUrl.parse("/api/v1")
val absoluteUrl = baseUrl.resolve(relativeUrl)

Q: Why are my bracket/quote characters not being stripped?

A: You need to use the appropriate detection option:

// Wrong - Default doesn't handle brackets
val detector1 = UrlDetector.default
detector1.extract("(https://example.com)") // Includes parentheses

// Correct - Use BracketMatch
val detector2 = UrlDetector(UrlDetectorOptions.BracketMatch)
detector2.extract("(https://example.com)") // https://example.com

// For quotes - use QuoteMatch or SingleQuoteMatch
val detector3 = UrlDetector(UrlDetectorOptions.QuoteMatch)
detector3.extract("\"https://example.com\"") // https://example.com

Host Filtering Questions

Q: How does subdomain matching work?

A: When you specify a host, all subdomains are automatically matched:

import io.lemonlabs.uri.Host

val detector = UrlDetector.default.withAllowed(Host.parse("example.com"))

// All of these match:
// - example.com
// - www.example.com
// - api.example.com
// - any.subdomain.example.com

Q: How do I allow only the apex domain (no subdomains)?

A: The library doesn't support apex-only matching. Workaround using post-filtering:

val detector = UrlDetector.default.withAllowed(Host.parse("example.com"))
val urls = detector.extract(text)

// Filter to apex domain only
val apexOnly = urls.filter { url =>
  val host = url.host.toString
  host == "example.com" || host == "www.example.com"
}

Q: What happens when a URL matches both allowed and denied hosts?

A: Denied hosts take precedence:

val detector = UrlDetector.default
  .withAllowed(Host.parse("example.com"))
  .withDenied(Host.parse("ads.example.com"))

detector.extract("Visit ads.example.com")
// Returns: empty Set (denied takes precedence)

Q: Can I use wildcards in host filtering?

A: No direct wildcard support, but subdomain matching provides similar functionality. For more complex patterns, use post-filtering:

val urls = UrlDetector.default.extract(text)

// Custom filtering with regex
val pattern = ".*\\.cdn\\.example\\.com".r
val cdnUrls = urls.filter { url =>
  pattern.matches(url.host.toString)
}

Performance Questions

Q: Is it better to create one detector or multiple?

A: Reuse detector instances when possible:

// Good - reuse instance
val detector = UrlDetector(UrlDetectorOptions.Html)
val results = documents.map(doc => detector.extract(doc))

// Less efficient - creates new instance each time
val results = documents.map { doc =>
  UrlDetector(UrlDetectorOptions.Html).extract(doc)
}

Q: How do I process large documents efficiently?

A: Consider chunking and parallel processing:

import scala.concurrent.{Future, ExecutionContext}
import scala.concurrent.ExecutionContext.Implicits.global

def extractParallel(largeText: String, chunkSize: Int = 10000): Future[Set[AbsoluteUrl]] = {
  val chunks = largeText.grouped(chunkSize).toSeq
  val detector = UrlDetector.default

  val futures = chunks.map(chunk => Future(detector.extract(chunk)))
  Future.sequence(futures).map(_.flatten.toSet)
}

Q: Does host filtering impact performance?

A: Host filtering is applied after URL extraction, so it has minimal impact. The overhead is O(m × h) where m is the number of detected URLs and h is the number of filter hosts.

Integration Questions

Q: How do I use this with Cats Effect or ZIO?

A: The library is synchronous but can be easily lifted:

Cats Effect:

import cats.effect.IO
import io.lambdaworks.detection.UrlDetector

def extractIO(text: String): IO[Set[AbsoluteUrl]] = {
  IO(UrlDetector.default.extract(text))
}

ZIO:

import zio._
import io.lambdaworks.detection.UrlDetector

def extractZIO(text: String): Task[Set[AbsoluteUrl]] = {
  ZIO.attempt(UrlDetector.default.extract(text))
}

Q: Can I use this with Akka Streams or FS2?

A: Yes, you can integrate it into stream processing:

FS2:

import fs2.Stream
import cats.effect.IO
import io.lambdaworks.detection.UrlDetector

val detector = UrlDetector.default

val pipeline: Stream[IO, Set[AbsoluteUrl]] =
  Stream("text1", "text2", "text3")
    .evalMap(text => IO(detector.extract(text)))

Akka Streams:

import akka.stream.scaladsl._
import io.lambdaworks.detection.UrlDetector

val detector = UrlDetector.default

val flow: Flow[String, Set[AbsoluteUrl], NotUsed] =
  Flow[String].map(text => detector.extract(text))

Troubleshooting

Issue: `NoClassDefFoundError` or `ClassNotFoundException`

Cause: Missing dependencies in classpath

Solution: Ensure all dependencies are included:

// build.sbt
libraryDependencies ++= Seq(
  "io.lambdaworks" %% "scurl-detector" % "version"
)

For manual classpath setup, ensure these are included:

scurl-detector
scala-uri
url-detector (LinkedIn's Java library)
commons-validator

Issue: URLs in JSON not detected correctly

Cause: Using wrong detection option

Solution: Use UrlDetectorOptions.Json:

// Wrong
val urls = UrlDetector.default.extract(jsonString)

// Correct
val urls = UrlDetector(UrlDetectorOptions.Json).extract(jsonString)

Issue: Too many false positives

Solutions:

Use appropriate detection option for your content type
Apply host filtering to limit results
Post-filter results with custom validation:

def isValidUrl(url: AbsoluteUrl): Boolean = {
  val validSchemes = Set("http", "https")
  val hasValidScheme = url.schemeOption.exists(validSchemes.contains)
  val hasStandardPort = url.port.forall(p => p == 80 || p == 443 || p == 8080)

  hasValidScheme && hasStandardPort
}

val urls = detector.extract(text).filter(isValidUrl)

Issue: IPv6 URLs not detected

Cause: IPv6 URLs need proper bracket notation

Solution: Ensure IPv6 addresses are properly formatted:

// Correct format
val urls = detector.extract("http://[2001:db8::1]:8080/")

// Incorrect format (won't be detected)
val urls = detector.extract("http://2001:db8::1:8080/")

Issue: Internal/intranet URLs not detected

Cause: Single-level domains are rejected by default

Solution: Use AllowSingleLevelDomain:

val detector = UrlDetector(UrlDetectorOptions.AllowSingleLevelDomain)
val urls = detector.extract("http://intranet/docs")

Issue: Memory usage is high with large documents

Solutions:

Process in chunks:

val chunkSize = 10000
val chunks = largeText.grouped(chunkSize)
val allUrls = chunks.flatMap(chunk => detector.extract(chunk)).toSet

Stream processing:

import scala.io.Source
val urls = Source.fromFile("large.txt")
  .getLines()
  .flatMap(line => detector.extract(line))
  .toSet

Limit result size with filtering

Issue: Getting `java.net.MalformedURLException`

Cause: This shouldn't happen - the library catches parsing exceptions

If you see this:

Check if you're directly using scala-uri's parse methods
Ensure you're using the library's extract method
Report as a bug if it's coming from extract

// Safe - exceptions are caught internally
val urls = detector.extract(text)

// Unsafe - can throw exceptions
val url = AbsoluteUrl.parse(possiblyInvalidUrl)

// Safe alternative
Try(AbsoluteUrl.parse(possiblyInvalidUrl)).toOption

Common Patterns

Pattern: Validating User Input

def extractSafeUrls(userInput: String): Set[AbsoluteUrl] = {
  val detector = UrlDetector(UrlDetectorOptions.Default)
    .withDenied(
      Host.parse("localhost"),
      Host.parse("127.0.0.1"),
      Host.parse("0.0.0.0")
    )

  detector.extract(userInput)
    .filter(_.schemeOption.exists(s => Set("http", "https").contains(s)))
}

Pattern: Extracting APIs from Documentation

def extractApiUrls(docs: String): Set[AbsoluteUrl] = {
  UrlDetector(UrlDetectorOptions.BracketMatch)
    .extract(docs)
    .filter(_.path.toString().contains("/api/"))
}

Pattern: Building a URL Allow List

val allowedDomains = Set("example.com", "api.example.com", "cdn.example.com")

val detector = allowedDomains.foldLeft(UrlDetector.default) { (det, domain) =>
  det.withAllowed(Host.parse(domain))
}

Getting Help

If you can't find the answer here:

Check the examples documentation
Review the API reference
Search GitHub Issues
Open a new issue with:
- Scala version
- Library version
- Minimal reproducible example
- Expected vs actual behavior

Reporting Bugs

When reporting bugs, please include:

Scala version: (e.g., 2.13.12)
Library version: (e.g., scurl-detector 1.0.0)
Minimal code example:

val detector = UrlDetector.default
val result = detector.extract("your test input")
println(result) // What you got
// Expected: what you expected

Error messages (if any)
Environment details (OS, JVM version)

Frequently Asked Questions​

General Questions​

Q: What is Scala URL Detector?​

Q: Which Scala versions are supported?​

Q: Is this library thread-safe?​

Q: What license is this library under?​

Usage Questions​

Q: Why are some URLs not being detected?​

Q: Why is my URL extracted with http:// instead of https://?​

Q: How do I detect localhost URLs?​

Q: Can I extract URLs from multiple formats in one pass?​

Q: How do I handle relative URLs?​

Q: Why are my bracket/quote characters not being stripped?​

Host Filtering Questions​

Q: How does subdomain matching work?​

Q: How do I allow only the apex domain (no subdomains)?​

Q: What happens when a URL matches both allowed and denied hosts?​

Q: Can I use wildcards in host filtering?​

Performance Questions​

Q: Is it better to create one detector or multiple?​

Q: How do I process large documents efficiently?​

Q: Does host filtering impact performance?​

Integration Questions​

Q: How do I use this with Cats Effect or ZIO?​

Q: Can I use this with Akka Streams or FS2?​

Troubleshooting​

Issue: NoClassDefFoundError or ClassNotFoundException​

Issue: URLs in JSON not detected correctly​

Issue: Too many false positives​

Issue: IPv6 URLs not detected​

Issue: Internal/intranet URLs not detected​

Issue: Memory usage is high with large documents​

Issue: Getting java.net.MalformedURLException​

Common Patterns​

Pattern: Validating User Input​

Pattern: Extracting APIs from Documentation​

Pattern: Building a URL Allow List​

Getting Help​

Reporting Bugs​