Archive Retrieval¶
The cbintel.lazarus module provides access to historical web content from Internet Archive and Common Crawl.
Overview¶
graph TB
subgraph "Discovery"
GAU[gau binary]
SOURCES[Wayback, CommonCrawl]
end
subgraph "CDX API"
IA[Internet Archive]
CC[Common Crawl]
end
subgraph "Lazarus"
DISCOVERY[URLDiscovery]
CDX[CDXClient]
ARCHIVE[ArchiveClient]
TEMPORAL[TemporalAnalyzer]
end
GAU --> SOURCES
DISCOVERY --> GAU
CDX --> IA & CC
ARCHIVE --> DISCOVERY & CDX
TEMPORAL --> ARCHIVE
Module Structure¶
src/cbintel/lazarus/
├── __init__.py # Public exports
├── cdx_client.py # CDX API client
├── url_discovery.py # gau wrapper
├── archive_client.py # High-level client
└── temporal.py # Temporal analysis
CDXClient¶
Query Internet Archive and Common Crawl CDX APIs for historical snapshots.
Basic Usage¶
from cbintel.lazarus import CDXClient
from datetime import datetime
client = CDXClient(source="ia") # "ia" or "cc"
# Query snapshots for a URL
records = await client.query(
"https://example.com",
from_date=datetime(2020, 1, 1),
to_date=datetime(2024, 1, 1),
limit=100,
)
for record in records:
print(f"{record.timestamp}: {record.status}")
content = await record.fetch() # Get archived content
Query Options¶
records = await client.query(
url="https://example.com",
from_date=datetime(2020, 1, 1),
to_date=datetime(2024, 1, 1),
limit=100,
filter_status=200, # Only successful captures
filter_mimetype="text/html", # Only HTML pages
collapse="digest", # Deduplicate by content
)
CDX Record¶
@dataclass
class CDXRecord:
url: str # Original URL
timestamp: datetime # Capture timestamp
status: int # HTTP status code
mimetype: str # Content type
digest: str # Content hash
length: int # Content length
source: str # "ia" or "cc"
async def fetch(self) -> str:
"""Fetch archived content"""
Sources¶
| Source | Description | API |
|---|---|---|
ia |
Internet Archive Wayback Machine | web.archive.org |
cc |
Common Crawl | index.commoncrawl.org |
URLDiscovery¶
Discover historical URLs using the gau binary.
Basic Usage¶
from cbintel.lazarus import URLDiscovery
discovery = URLDiscovery()
# Discover URLs for a domain
result = await discovery.discover(
"example.com",
sources=["wayback", "commoncrawl"],
limit=1000,
)
print(f"Found {result.total_urls} URLs")
for url in result.sample(50): # Random sample
print(url)
Discovery Sources¶
| Source | Description |
|---|---|
wayback |
Internet Archive Wayback Machine |
commoncrawl |
Common Crawl index |
otx |
AlienVault OTX |
urlscan |
urlscan.io |
Filter Options¶
result = await discovery.discover(
"example.com",
sources=["wayback"],
limit=1000,
blacklist=["tracking", "analytics"], # Exclude patterns
extensions=["html", "php", "aspx"], # Only these extensions
)
ArchiveClient¶
High-level orchestration of discovery and retrieval.
Basic Usage¶
from cbintel.lazarus import ArchiveClient
client = ArchiveClient()
# Get snapshots of a URL
snapshots = await client.get_snapshots(
"https://example.com/page",
from_date=datetime(2020, 1, 1),
)
# Process entire domain
async for snapshot in client.process_domain("example.com"):
print(f"{snapshot.timestamp}: {snapshot.url}")
content = await snapshot.fetch()
Compare Versions¶
# Compare two versions of a page
diff = await client.compare_versions(
"https://example.com/page",
datetime(2020, 1, 1),
datetime(2024, 1, 1),
)
print(f"Similarity: {diff.similarity:.2%}")
print(f"Additions: {diff.additions}")
print(f"Deletions: {diff.deletions}")
Bulk Retrieval¶
# Retrieve multiple URLs in parallel
urls = ["https://example.com/page1", "https://example.com/page2"]
async for result in client.bulk_retrieve(urls):
print(f"{result.url}: {len(result.content)} bytes")
TemporalAnalyzer¶
Analyze content changes over time.
Build Timeline¶
from cbintel.lazarus import TemporalAnalyzer
analyzer = TemporalAnalyzer()
# Build timeline from snapshots
timeline = await analyzer.build_timeline(snapshots)
print(f"Timeline: {timeline.start_date} to {timeline.end_date}")
print(f"Total snapshots: {len(timeline.snapshots)}")
Detect Changes¶
# Detect significant changes
for change in timeline.changes:
print(f"{change.date}: {change.change_type}")
print(f" Similarity: {change.similarity:.2%}")
print(f" Summary: {change.summary}")
Statistics¶
stats = analyzer.get_stats(snapshots)
print(f"Total snapshots: {stats.total_snapshots}")
print(f"Year distribution: {stats.year_distribution}")
print(f"Status distribution: {stats.status_distribution}")
print(f"Average interval: {stats.avg_interval_days} days")
Graph Operations¶
archive_discover Operation¶
- op: archive_discover
params:
domain: "example.com"
sources: ["wayback", "commoncrawl"]
limit: 1000
output: urls
fetch_archive Operation¶
- op: fetch_archive
input: urls
params:
date: "2020-01-01"
fallback: true # Try closest date if exact not found
output: content
Example Pipeline¶
name: temporal_analysis
description: Track content changes over time
stages:
- name: discover
sequential:
- op: archive_discover
params:
domain: "example.com"
sources: ["wayback"]
output: urls
- name: retrieve
parallel_foreach:
input: urls
operations:
- op: fetch_archive
params:
date: "2020-01-01"
output: content_2020
- op: fetch_archive
params:
date: "2024-01-01"
output: content_2024
- name: analyze
sequential:
- op: diff
input: [content_2020, content_2024]
output: changes
CLI Commands¶
# Discover archived URLs
cbintel-lazarus discover example.com --limit 100
# Get snapshots
cbintel-lazarus snapshot https://example.com/page \
--from 2020-01-01 --to 2024-01-01
# Generate timeline
cbintel-lazarus timeline https://example.com --output timeline.json
# Archive statistics
cbintel-lazarus stats example.com
Configuration¶
Environment Variables¶
# gau binary path
GAU_PATH=/usr/local/bin/gau
# CDX timeout
CDX_TIMEOUT=60.0
# Max concurrent requests
LAZARUS_CONCURRENCY=10
Requirements¶
The gau binary is required for URL discovery:
Error Handling¶
from cbintel.lazarus import (
LazarusError,
CDXError,
DiscoveryError,
ContentNotFoundError,
)
try:
content = await record.fetch()
except ContentNotFoundError:
print("Archived content not available")
except CDXError as e:
print(f"CDX API error: {e}")
except LazarusError as e:
print(f"Lazarus error: {e}")
Best Practices¶
Rate Limiting¶
import asyncio
async def fetch_with_delay(records, delay=1.0):
for record in records:
content = await record.fetch()
await asyncio.sleep(delay) # Be nice to archive APIs
Content Deduplication¶
# Use digest-based deduplication
records = await client.query(
url,
collapse="digest" # Group by content hash
)
Error Recovery¶
async def fetch_with_retry(record, max_retries=3):
for attempt in range(max_retries):
try:
return await record.fetch()
except ContentNotFoundError:
raise # Don't retry - content doesn't exist
except CDXError:
await asyncio.sleep(2 ** attempt)
raise Exception("Max retries exceeded")
Use Cases¶
Track Website Evolution¶
# Get all snapshots of a page
snapshots = await client.get_snapshots(
"https://example.com/about",
from_date=datetime(2010, 1, 1),
)
# Build timeline
timeline = await analyzer.build_timeline(snapshots)
# Find significant changes
for change in timeline.changes:
if change.similarity < 0.8: # Major change
print(f"{change.date}: Major update")
Policy Change Detection¶
# Compare current vs historical
current = await http.get("https://example.com/policy")
historical = await archive.fetch(
"https://example.com/policy",
date=datetime(2020, 1, 1)
)
diff = await analyzer.compare(current.text, historical)
if await has_significant_changes(current.text, historical):
print("Policy has changed significantly")