Skip to content

Archive Retrieval

The cbintel.lazarus module provides access to historical web content from Internet Archive and Common Crawl.

Overview

graph TB
    subgraph "Discovery"
        GAU[gau binary]
        SOURCES[Wayback, CommonCrawl]
    end

    subgraph "CDX API"
        IA[Internet Archive]
        CC[Common Crawl]
    end

    subgraph "Lazarus"
        DISCOVERY[URLDiscovery]
        CDX[CDXClient]
        ARCHIVE[ArchiveClient]
        TEMPORAL[TemporalAnalyzer]
    end

    GAU --> SOURCES
    DISCOVERY --> GAU
    CDX --> IA & CC
    ARCHIVE --> DISCOVERY & CDX
    TEMPORAL --> ARCHIVE

Module Structure

src/cbintel/lazarus/
├── __init__.py          # Public exports
├── cdx_client.py        # CDX API client
├── url_discovery.py     # gau wrapper
├── archive_client.py    # High-level client
└── temporal.py          # Temporal analysis

CDXClient

Query Internet Archive and Common Crawl CDX APIs for historical snapshots.

Basic Usage

from cbintel.lazarus import CDXClient
from datetime import datetime

client = CDXClient(source="ia")  # "ia" or "cc"

# Query snapshots for a URL
records = await client.query(
    "https://example.com",
    from_date=datetime(2020, 1, 1),
    to_date=datetime(2024, 1, 1),
    limit=100,
)

for record in records:
    print(f"{record.timestamp}: {record.status}")
    content = await record.fetch()  # Get archived content

Query Options

records = await client.query(
    url="https://example.com",
    from_date=datetime(2020, 1, 1),
    to_date=datetime(2024, 1, 1),
    limit=100,
    filter_status=200,          # Only successful captures
    filter_mimetype="text/html", # Only HTML pages
    collapse="digest",          # Deduplicate by content
)

CDX Record

@dataclass
class CDXRecord:
    url: str              # Original URL
    timestamp: datetime   # Capture timestamp
    status: int           # HTTP status code
    mimetype: str         # Content type
    digest: str           # Content hash
    length: int           # Content length
    source: str           # "ia" or "cc"

    async def fetch(self) -> str:
        """Fetch archived content"""

Sources

Source Description API
ia Internet Archive Wayback Machine web.archive.org
cc Common Crawl index.commoncrawl.org

URLDiscovery

Discover historical URLs using the gau binary.

Basic Usage

from cbintel.lazarus import URLDiscovery

discovery = URLDiscovery()

# Discover URLs for a domain
result = await discovery.discover(
    "example.com",
    sources=["wayback", "commoncrawl"],
    limit=1000,
)

print(f"Found {result.total_urls} URLs")
for url in result.sample(50):  # Random sample
    print(url)

Discovery Sources

Source Description
wayback Internet Archive Wayback Machine
commoncrawl Common Crawl index
otx AlienVault OTX
urlscan urlscan.io

Filter Options

result = await discovery.discover(
    "example.com",
    sources=["wayback"],
    limit=1000,
    blacklist=["tracking", "analytics"],  # Exclude patterns
    extensions=["html", "php", "aspx"],   # Only these extensions
)

ArchiveClient

High-level orchestration of discovery and retrieval.

Basic Usage

from cbintel.lazarus import ArchiveClient

client = ArchiveClient()

# Get snapshots of a URL
snapshots = await client.get_snapshots(
    "https://example.com/page",
    from_date=datetime(2020, 1, 1),
)

# Process entire domain
async for snapshot in client.process_domain("example.com"):
    print(f"{snapshot.timestamp}: {snapshot.url}")
    content = await snapshot.fetch()

Compare Versions

# Compare two versions of a page
diff = await client.compare_versions(
    "https://example.com/page",
    datetime(2020, 1, 1),
    datetime(2024, 1, 1),
)

print(f"Similarity: {diff.similarity:.2%}")
print(f"Additions: {diff.additions}")
print(f"Deletions: {diff.deletions}")

Bulk Retrieval

# Retrieve multiple URLs in parallel
urls = ["https://example.com/page1", "https://example.com/page2"]

async for result in client.bulk_retrieve(urls):
    print(f"{result.url}: {len(result.content)} bytes")

TemporalAnalyzer

Analyze content changes over time.

Build Timeline

from cbintel.lazarus import TemporalAnalyzer

analyzer = TemporalAnalyzer()

# Build timeline from snapshots
timeline = await analyzer.build_timeline(snapshots)

print(f"Timeline: {timeline.start_date} to {timeline.end_date}")
print(f"Total snapshots: {len(timeline.snapshots)}")

Detect Changes

# Detect significant changes
for change in timeline.changes:
    print(f"{change.date}: {change.change_type}")
    print(f"  Similarity: {change.similarity:.2%}")
    print(f"  Summary: {change.summary}")

Statistics

stats = analyzer.get_stats(snapshots)

print(f"Total snapshots: {stats.total_snapshots}")
print(f"Year distribution: {stats.year_distribution}")
print(f"Status distribution: {stats.status_distribution}")
print(f"Average interval: {stats.avg_interval_days} days")

Graph Operations

archive_discover Operation

- op: archive_discover
  params:
    domain: "example.com"
    sources: ["wayback", "commoncrawl"]
    limit: 1000
  output: urls

fetch_archive Operation

- op: fetch_archive
  input: urls
  params:
    date: "2020-01-01"
    fallback: true  # Try closest date if exact not found
  output: content

Example Pipeline

name: temporal_analysis
description: Track content changes over time

stages:
  - name: discover
    sequential:
      - op: archive_discover
        params:
          domain: "example.com"
          sources: ["wayback"]
        output: urls

  - name: retrieve
    parallel_foreach:
      input: urls
      operations:
        - op: fetch_archive
          params:
            date: "2020-01-01"
          output: content_2020
        - op: fetch_archive
          params:
            date: "2024-01-01"
          output: content_2024

  - name: analyze
    sequential:
      - op: diff
        input: [content_2020, content_2024]
        output: changes

CLI Commands

# Discover archived URLs
cbintel-lazarus discover example.com --limit 100

# Get snapshots
cbintel-lazarus snapshot https://example.com/page \
  --from 2020-01-01 --to 2024-01-01

# Generate timeline
cbintel-lazarus timeline https://example.com --output timeline.json

# Archive statistics
cbintel-lazarus stats example.com

Configuration

Environment Variables

# gau binary path
GAU_PATH=/usr/local/bin/gau

# CDX timeout
CDX_TIMEOUT=60.0

# Max concurrent requests
LAZARUS_CONCURRENCY=10

Requirements

The gau binary is required for URL discovery:

# Install gau
go install github.com/lc/gau/v2/cmd/gau@latest

# Verify installation
gau --version

Error Handling

from cbintel.lazarus import (
    LazarusError,
    CDXError,
    DiscoveryError,
    ContentNotFoundError,
)

try:
    content = await record.fetch()
except ContentNotFoundError:
    print("Archived content not available")
except CDXError as e:
    print(f"CDX API error: {e}")
except LazarusError as e:
    print(f"Lazarus error: {e}")

Best Practices

Rate Limiting

import asyncio

async def fetch_with_delay(records, delay=1.0):
    for record in records:
        content = await record.fetch()
        await asyncio.sleep(delay)  # Be nice to archive APIs

Content Deduplication

# Use digest-based deduplication
records = await client.query(
    url,
    collapse="digest"  # Group by content hash
)

Error Recovery

async def fetch_with_retry(record, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await record.fetch()
        except ContentNotFoundError:
            raise  # Don't retry - content doesn't exist
        except CDXError:
            await asyncio.sleep(2 ** attempt)
    raise Exception("Max retries exceeded")

Use Cases

Track Website Evolution

# Get all snapshots of a page
snapshots = await client.get_snapshots(
    "https://example.com/about",
    from_date=datetime(2010, 1, 1),
)

# Build timeline
timeline = await analyzer.build_timeline(snapshots)

# Find significant changes
for change in timeline.changes:
    if change.similarity < 0.8:  # Major change
        print(f"{change.date}: Major update")

Policy Change Detection

# Compare current vs historical
current = await http.get("https://example.com/policy")
historical = await archive.fetch(
    "https://example.com/policy",
    date=datetime(2020, 1, 1)
)

diff = await analyzer.compare(current.text, historical)
if await has_significant_changes(current.text, historical):
    print("Policy has changed significantly")