Archive Retrieval¶

The cbintel.lazarus module provides access to historical web content from Internet Archive and Common Crawl.

Overview¶

graph TB
    subgraph "Discovery"
        GAU[gau binary]
        SOURCES[Wayback, CommonCrawl]
    end

    subgraph "CDX API"
        IA[Internet Archive]
        CC[Common Crawl]
    end

    subgraph "Lazarus"
        DISCOVERY[URLDiscovery]
        CDX[CDXClient]
        ARCHIVE[ArchiveClient]
        TEMPORAL[TemporalAnalyzer]
    end

    GAU --> SOURCES
    DISCOVERY --> GAU
    CDX --> IA & CC
    ARCHIVE --> DISCOVERY & CDX
    TEMPORAL --> ARCHIVE

Module Structure¶

src/cbintel/lazarus/
├── __init__.py          # Public exports
├── cdx_client.py        # CDX API client
├── url_discovery.py     # gau wrapper
├── archive_client.py    # High-level client
└── temporal.py          # Temporal analysis

CDXClient¶

Query Internet Archive and Common Crawl CDX APIs for historical snapshots.

Basic Usage¶

from cbintel.lazarus import CDXClient
from datetime import datetime

client = CDXClient(source="ia")  # "ia" or "cc"

# Query snapshots for a URL
records = await client.query(
    "https://example.com",
    from_date=datetime(2020, 1, 1),
    to_date=datetime(2024, 1, 1),
    limit=100,
)

for record in records:
    print(f"{record.timestamp}: {record.status}")
    content = await record.fetch()  # Get archived content

Query Options¶

records = await client.query(
    url="https://example.com",
    from_date=datetime(2020, 1, 1),
    to_date=datetime(2024, 1, 1),
    limit=100,
    filter_status=200,          # Only successful captures
    filter_mimetype="text/html", # Only HTML pages
    collapse="digest",          # Deduplicate by content
)

CDX Record¶

@dataclass
class CDXRecord:
    url: str              # Original URL
    timestamp: datetime   # Capture timestamp
    status: int           # HTTP status code
    mimetype: str         # Content type
    digest: str           # Content hash
    length: int           # Content length
    source: str           # "ia" or "cc"

    async def fetch(self) -> str:
        """Fetch archived content"""

Sources¶

Source	Description	API
`ia`	Internet Archive Wayback Machine	web.archive.org
`cc`	Common Crawl	index.commoncrawl.org

URLDiscovery¶

Discover historical URLs using the gau binary.

Basic Usage¶

from cbintel.lazarus import URLDiscovery

discovery = URLDiscovery()

# Discover URLs for a domain
result = await discovery.discover(
    "example.com",
    sources=["wayback", "commoncrawl"],
    limit=1000,
)

print(f"Found {result.total_urls} URLs")
for url in result.sample(50):  # Random sample
    print(url)

Discovery Sources¶

Source	Description
`wayback`	Internet Archive Wayback Machine
`commoncrawl`	Common Crawl index
`otx`	AlienVault OTX
`urlscan`	urlscan.io

result = await discovery.discover(
    "example.com",
    sources=["wayback"],
    limit=1000,
    blacklist=["tracking", "analytics"],  # Exclude patterns
    extensions=["html", "php", "aspx"],   # Only these extensions
)

ArchiveClient¶

High-level orchestration of discovery and retrieval.

Basic Usage¶

from cbintel.lazarus import ArchiveClient

client = ArchiveClient()

# Get snapshots of a URL
snapshots = await client.get_snapshots(
    "https://example.com/page",
    from_date=datetime(2020, 1, 1),
)

# Process entire domain
async for snapshot in client.process_domain("example.com"):
    print(f"{snapshot.timestamp}: {snapshot.url}")
    content = await snapshot.fetch()

Compare Versions¶

# Compare two versions of a page
diff = await client.compare_versions(
    "https://example.com/page",
    datetime(2020, 1, 1),
    datetime(2024, 1, 1),
)

print(f"Similarity: {diff.similarity:.2%}")
print(f"Additions: {diff.additions}")
print(f"Deletions: {diff.deletions}")

Bulk Retrieval¶

# Retrieve multiple URLs in parallel
urls = ["https://example.com/page1", "https://example.com/page2"]

async for result in client.bulk_retrieve(urls):
    print(f"{result.url}: {len(result.content)} bytes")

TemporalAnalyzer¶

Analyze content changes over time.

Build Timeline¶

from cbintel.lazarus import TemporalAnalyzer

analyzer = TemporalAnalyzer()

# Build timeline from snapshots
timeline = await analyzer.build_timeline(snapshots)

print(f"Timeline: {timeline.start_date} to {timeline.end_date}")
print(f"Total snapshots: {len(timeline.snapshots)}")

Detect Changes¶

# Detect significant changes
for change in timeline.changes:
    print(f"{change.date}: {change.change_type}")
    print(f"  Similarity: {change.similarity:.2%}")
    print(f"  Summary: {change.summary}")

Statistics¶

stats = analyzer.get_stats(snapshots)

print(f"Total snapshots: {stats.total_snapshots}")
print(f"Year distribution: {stats.year_distribution}")
print(f"Status distribution: {stats.status_distribution}")
print(f"Average interval: {stats.avg_interval_days} days")

Graph Operations¶

archive_discover Operation¶

- op: archive_discover
  params:
    domain: "example.com"
    sources: ["wayback", "commoncrawl"]
    limit: 1000
  output: urls

fetch_archive Operation¶

- op: fetch_archive
  input: urls
  params:
    date: "2020-01-01"
    fallback: true  # Try closest date if exact not found
  output: content

Example Pipeline¶

name: temporal_analysis
description: Track content changes over time

stages:
  - name: discover
    sequential:
      - op: archive_discover
        params:
          domain: "example.com"
          sources: ["wayback"]
        output: urls

  - name: retrieve
    parallel_foreach:
      input: urls
      operations:
        - op: fetch_archive
          params:
            date: "2020-01-01"
          output: content_2020
        - op: fetch_archive
          params:
            date: "2024-01-01"
          output: content_2024

  - name: analyze
    sequential:
      - op: diff
        input: [content_2020, content_2024]
        output: changes

CLI Commands¶

# Discover archived URLs
cbintel-lazarus discover example.com --limit 100

# Get snapshots
cbintel-lazarus snapshot https://example.com/page \
  --from 2020-01-01 --to 2024-01-01

# Generate timeline
cbintel-lazarus timeline https://example.com --output timeline.json

# Archive statistics
cbintel-lazarus stats example.com

Configuration¶

Environment Variables¶

# gau binary path
GAU_PATH=/usr/local/bin/gau

# CDX timeout
CDX_TIMEOUT=60.0

# Max concurrent requests
LAZARUS_CONCURRENCY=10

Requirements¶

The gau binary is required for URL discovery:

# Install gau
go install github.com/lc/gau/v2/cmd/gau@latest

# Verify installation
gau --version

Error Handling¶

from cbintel.lazarus import (
    LazarusError,
    CDXError,
    DiscoveryError,
    ContentNotFoundError,
)

try:
    content = await record.fetch()
except ContentNotFoundError:
    print("Archived content not available")
except CDXError as e:
    print(f"CDX API error: {e}")
except LazarusError as e:
    print(f"Lazarus error: {e}")

Best Practices¶

Rate Limiting¶

import asyncio

async def fetch_with_delay(records, delay=1.0):
    for record in records:
        content = await record.fetch()
        await asyncio.sleep(delay)  # Be nice to archive APIs

Content Deduplication¶

# Use digest-based deduplication
records = await client.query(
    url,
    collapse="digest"  # Group by content hash
)

Error Recovery¶

async def fetch_with_retry(record, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await record.fetch()
        except ContentNotFoundError:
            raise  # Don't retry - content doesn't exist
        except CDXError:
            await asyncio.sleep(2 ** attempt)
    raise Exception("Max retries exceeded")

Use Cases¶

Track Website Evolution¶

# Get all snapshots of a page
snapshots = await client.get_snapshots(
    "https://example.com/about",
    from_date=datetime(2010, 1, 1),
)

# Build timeline
timeline = await analyzer.build_timeline(snapshots)

# Find significant changes
for change in timeline.changes:
    if change.similarity < 0.8:  # Major change
        print(f"{change.date}: Major update")

Policy Change Detection¶

# Compare current vs historical
current = await http.get("https://example.com/policy")
historical = await archive.fetch(
    "https://example.com/policy",
    date=datetime(2020, 1, 1)
)

diff = await analyzer.compare(current.text, historical)
if await has_significant_changes(current.text, historical):
    print("Policy has changed significantly")

Archive Retrieval¶

Overview¶

Module Structure¶

CDXClient¶

Basic Usage¶

Query Options¶

CDX Record¶

Sources¶

URLDiscovery¶

Basic Usage¶

Discovery Sources¶

Filter Options¶

ArchiveClient¶

Basic Usage¶

Compare Versions¶

Bulk Retrieval¶

TemporalAnalyzer¶

Build Timeline¶

Detect Changes¶

Statistics¶

Graph Operations¶

archive_discover Operation¶

fetch_archive Operation¶

Example Pipeline¶

CLI Commands¶

Configuration¶

Environment Variables¶

Requirements¶

Error Handling¶

Best Practices¶

Rate Limiting¶

Content Deduplication¶

Error Recovery¶

Use Cases¶

Track Website Evolution¶

Policy Change Detection¶