cbintel Architecture¶
System architecture and component relationships for the cbintel intelligence toolkit.
Overview¶
cbintel is a modular intelligence gathering and knowledge synthesis platform that combines: - Web Crawling - AI-powered iterative web discovery - Historical Archives - Internet Archive and Common Crawl retrieval - Vector Search - Semantic similarity search with embeddings - Browser Automation - Screenshots, PDFs, and DOM extraction - VPN Cluster - Geographic proxy routing via OpenWRT workers - Jobs API - Async job submission with progress tracking and result storage
System Architecture¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ cbintel CLI Tools │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────────────────┤
│ cbintel- │ cbintel- │ cbintel- │ cbintel- │ cbintel- │
│ crawl │ lazarus │ vectl │ screenshots │ cluster │
└──────┬──────┴──────┬──────┴──────┬──────┴──────┬──────┴──────────┬──────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ cbintel Sub-Services │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────────────────┤
│ crawl │ lazarus │ vectl │ screenshots │ cluster │
│ │ │ │ │ │
│ - Pipeline │ - CDX API │ - Embeddings│ - Capture │ - Device Registry │
│ - Batches │ - Discovery │ - Storage │ - PDF Gen │ - VPN Banks │
│ - Evaluate │ - Archives │ - Search │ - DOM │ - Workers │
│ - Synthesize│ - Temporal │ - Clustering│ - Links │ - HAProxy │
└──────┬──────┴──────┬──────┴──────┬──────┴──────┬──────┴──────────┬──────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Core Libraries │
├───────────────────┬───────────────────┬─────────────────────────────────────┤
│ cbintel.ai │ cbintel.net │ cbintel.io │
│ │ │ │
│ - Anthropic API │ - HTTP Client │ - HTML Processing │
│ - Ollama Client │ - URL Cleaning │ - Markdown Conversion │
│ - CBAI Unified │ - Web Search │ - File Storage │
│ - Embeddings │ - Proxy Support │ - Session Management │
└─────────┬─────────┴─────────┬─────────┴───────────────┬─────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ External Dependencies │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────────────────┤
│ Anthropic │ Ollama │ Playwright │ cdx_toolkit │ OpenWRT/LuCI │
│ Claude API │ Local LLM │ Browsers │ Web Archive │ RPC Interface │
└─────────────┴─────────────┴─────────────┴─────────────┴─────────────────────┘
Component Details¶
1. cbintel.crawl - AI-Powered Web Crawling¶
Iterative web discovery with AI-driven evaluation and synthesis.
User Query
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Discover │────▶│ Retrieve │────▶│ Process │
│ (Search) │ │ (Fetch) │ │ (Parse/Clean)│
└─────────────┘ └─────────────┘ └──────┬──────┘
│
┌──────────────────────────────────────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Evaluate │────▶│ Decide │────▶│ Synthesize │
│ (AI Score) │ │ (Continue?) │ │ (Report) │
└─────────────┘ └──────┬──────┘ └─────────────┘
│
│ More batches needed
▼
┌─────────────┐
│ Child Batch │
│ (New URLs) │
└─────────────┘
Key Features: - Multi-model AI support (Anthropic Claude, Ollama local models) - Iterative batch processing with child URL discovery - Quality-based content evaluation - Automatic synthesis and report generation
2. cbintel.lazarus - Historical Web Archives¶
Retrieve and analyze historical web content from Internet Archive and Common Crawl.
Domain/URL
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Discovery │────▶│ CDX API │────▶│ Retrieve │
│ (gau) │ │ Query │ │ Content │
└─────────────┘ └─────────────┘ └──────┬──────┘
│
┌──────────────────────────────────────────┘
│
▼
┌─────────────┐ ┌─────────────┐
│ Temporal │────▶│ Timeline │
│ Analysis │ │ Report │
└─────────────┘ └─────────────┘
Components: - CDXClient - Query Internet Archive/Common Crawl CDX APIs - URLDiscovery - Discover URLs via gau (wayback, commoncrawl, etc.) - ArchiveClient - High-level orchestration of discovery + retrieval - TemporalAnalyzer - Time-series content change analysis
3. cbintel.vectl - Vector Embeddings & Search¶
Semantic similarity search using text embeddings and vector storage.
Documents
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Chunk │────▶│ Embed │────▶│ Store │
│ Text │ │ (Ollama) │ │ (vectl) │
└─────────────┘ └─────────────┘ └──────┬──────┘
│
Query │
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Embed │────▶│ Search │────▶│ Results │
│ Query │ │ (K-means) │ │ (Ranked) │
└─────────────┘ └─────────────┘ └─────────────┘
Components: - EmbeddingService - Generate 768D vectors via Ollama (nomic-embed-text) - VectorStore - K-means clustered storage (vectl C++ or NumPy fallback) - SemanticSearch - Text-to-text similarity search - ChunkingService - Split documents into overlapping chunks
4. cbintel.screenshots - Browser Automation¶
Screenshot capture, PDF generation, and DOM extraction using Playwright.
URL
│
▼
┌─────────────┐
│ Playwright │
│ Browser │
└──────┬──────┘
│
├────────────────┬────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Screenshot │ │ PDF │ │ DOM │
│ Capture │ │ Generation │ │ Extraction │
└─────────────┘ └─────────────┘ └─────────────┘
Components: - ScreenshotService - Full-page and element screenshots - PDFService - PDF generation with configurable format/margins - DOMService - Element extraction with bounding boxes
5. cbintel.cluster - VPN Cluster Management¶
Geographic VPN routing via 16 OpenWRT workers with HAProxy load balancing.
┌─────────────────────────────────────────────────────────────────┐
│ Host Server │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ FastAPI Cluster API (port 9002) │ │
│ │ /api/v1/banks /api/v1/workers /api/v1/devices │ │
│ └──────────────────────────┬──────────────────────────────┘ │
└──────────────────────────────┼──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Master Router (17.0.0.1) │
│ ┌─────────────┐ ┌─────────────────────────────────────┐ │
│ │ HAProxy │ │ LuCI RPC │ │
│ │ 8890-8999 │ │ Device Control Interface │ │
│ └──────┬──────┘ └─────────────────────────────────────┘ │
└─────────┼───────────────────────────────────────────────────────┘
│
│ Load Balance
▼
┌─────────────────────────────────────────────────────────────────┐
│ OpenWRT Workers (16x) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Worker 1 │ │Worker 2 │ │Worker 3 │ ... │Worker 16│ │
│ │17.0.0.10│ │17.0.0.11│ │17.0.0.12│ │17.0.0.25│ │
│ │ │ │ │ │ │ │ │ │
│ │ OpenVPN │ │ OpenVPN │ │ OpenVPN │ │ OpenVPN │ │
│ │TinyProxy│ │TinyProxy│ │TinyProxy│ │TinyProxy│ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
└───────┼──────────┼──────────┼────────────────────┼──────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ ProtonVPN Exit Nodes │
│ (~12,900 profiles across 127 countries) │
└─────────────────────────────────────────────────────────────────┘
Components: - DeviceRegistry - Comprehensive device tracking with WireGuard info - DeviceService - Ping, speedtest, execute, reboot operations - BankService - Geographic VPN pool management - WorkerService - VPN and proxy control per worker - StateManager - Persistent JSON-based state
6. cbintel.jobs - Async Job Processing API¶
Unified job submission and processing for all cbintel modules.
Client Request
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Jobs API (port 9003) │
│ POST /api/v1/jobs/{crawl,lazarus,vectl,screenshots} │
│ GET /api/v1/jobs/{job_id} (poll status) │
│ DELETE /api/v1/jobs/{job_id} (cancel) │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Job Queue (Redis/In-Memory) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼─────────────┼─────────────┼─────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CrawlWorker │ │LazarusWorker│ │ VectlWorker │ │ Screenshot │
│ │ │ │ │ │ │ Worker │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ files.nominate.ai (Result Storage) │
│ cbintel-jobs bucket for job outputs │
└─────────────────────────────────────────────────────────────────┘
Components: - JobQueue - Redis-backed task queue with in-memory fallback - BaseWorker - Abstract worker with progress callbacks - CrawlWorker - Wraps CrawlPipeline for research queries - LazarusWorker - Wraps ArchiveClient for historical retrieval - VectlWorker - Wraps EmbeddingService for vector operations - ScreenshotWorker - Wraps ScreenshotService for captures - FilesClient - Uploads results to files.nominate.ai
Job Lifecycle:
Data Flow¶
Typical Crawl Pipeline¶
1. User submits query via cbintel-crawl CLI
│
▼
2. Search engine discovers initial URLs
│
▼
3. URLs fetched, HTML parsed to markdown
│
▼
4. AI evaluates content relevance (0-10 score)
│
▼
5. High-scoring pages analyzed for child URLs
│
▼
6. Child batch created with discovered URLs
│
▼
7. Repeat steps 3-6 until depth limit or satisfaction
│
▼
8. Final synthesis generates report
Archive Research Pipeline¶
1. User queries domain via cbintel-lazarus
│
▼
2. gau discovers historical URLs from wayback
│
▼
3. CDX API queried for snapshots of each URL
│
▼
4. Content retrieved from Internet Archive
│
▼
5. Temporal analysis detects content changes
│
▼
6. Timeline report generated
Semantic Search Pipeline¶
1. Documents indexed via cbintel-vectl index
│
▼
2. Text chunked into 512-word segments
│
▼
3. Ollama generates embeddings (768D vectors)
│
▼
4. Vectors stored with K-means clustering
│
▼
5. User queries via cbintel-vectl search
│
▼
6. Query embedded and matched to clusters
│
▼
7. Cosine similarity ranks results
Configuration¶
Environment Variables¶
# AI/LLM
ANTHROPIC_API_KEY=sk-ant-...
OLLAMA_BASE_URL=http://127.0.0.1:11434
OLLAMA_EMBED_MODEL=nomic-embed-text
# Cluster API
OPENWRT_USERNAME=root
OPENWRT_PASSWORD=<password>
MASTER_IP=17.0.0.1
CLUSTER_API_PORT=9002
# Storage
BANK_STATE_FILE=/var/lib/vpn-banks/bank-state.json
DEVICE_REGISTRY_FILE=/var/lib/vpn-banks/device-registry.json
# VPN Profiles
PROFILES_BASE=/path/to/profiles/intl-ovpn
Deployment¶
Development¶
# Clone and install
git clone <repo>
cd cbintel
pip install -e ".[dev]"
# Install Playwright browsers
playwright install
# Run tests
pytest
Production¶
# Install package
pip install .
# Start services manually
cbintel-cluster # VPN Cluster API on port 9002
cbintel-jobs # Jobs API on port 9003
# Or via systemd (recommended)
sudo systemctl enable cbcluster cbjobs
sudo systemctl start cbcluster cbjobs
Systemd Services¶
| Service | Description | Port | URL |
|---|---|---|---|
cbcluster.service |
VPN Cluster Management API | 32203 | https://intel.nominate.ai |
cbjobs.service |
Async Job Processing API | 9003 | https://jobs.nominate.ai |
Service files location: /etc/systemd/system/
Nginx configs location: /etc/nginx/sites-nominate/
# Check service status
sudo systemctl status cbcluster cbjobs
# View logs
sudo journalctl -u cbjobs -f
sudo journalctl -u cbcluster -f
# Restart after code changes
sudo systemctl restart cbjobs
File Organization¶
cbintel/
├── src/cbintel/
│ ├── ai/ # AI client wrappers
│ ├── net/ # Network operations
│ ├── io/ # File/process I/O
│ ├── crawl/ # Crawl pipeline
│ ├── lazarus/ # Historical archives
│ ├── vectl/ # Vector search
│ ├── screenshots/ # Browser automation
│ ├── cluster/ # VPN cluster API
│ └── jobs/ # Async job queue with workers
├── docs/ # Documentation
├── extern/ # External project symlinks
└── tests/ # Test suite
Security Considerations¶
- API Keys: Store in environment variables, never in code
- VPN Profiles: Contain credentials, restrict file permissions
- Cluster API: Currently no authentication (TODO for production)
- Command Execution: Device execute endpoint runs as root
- Proxy Traffic: All cluster traffic routes through VPN tunnels