Job Types¶

cbintel supports seven job types, each handled by a specialized worker.

Overview¶

Job Type	Worker	Input	Output
`crawl`	CrawlWorker	Query, config	URLs, chunks, synthesis
`lazarus`	LazarusWorker	Domain/URL, date range	Snapshots, content
`vectl`	VectlWorker	Text or query	Embeddings or matches
`screenshot`	ScreenshotWorker	URLs	Images
`transcript`	TranscriptWorker	Video ID	Transcript text
`browser`	BrowserWorker	URL, actions	Automation results
`graph`	GraphWorker	Graph YAML, params	Graph execution result

crawl¶

AI-powered web crawling with iterative batch processing.

Request Schema¶

{
  "query": "AI regulation trends",
  "max_urls": 50,
  "max_depth": 3,
  "geo": "us:ca",
  "ai_model": "claude-3-5-sonnet-20241022",
  "min_score": 6.0,
  "search_provider": "duckduckgo"
}

Field	Type	Required	Default	Description
`query`	string	Yes	-	Research query
`max_urls`	int	No	50	Maximum URLs to process
`max_depth`	int	No	3	Maximum batch depth
`geo`	string	No	null	Geographic routing
`ai_model`	string	No	claude-3-5-sonnet	AI model for evaluation
`min_score`	float	No	6.0	Minimum relevance score
`search_provider`	string	No	duckduckgo	Search engine

Response Schema¶

{
  "job_id": "job_abc123",
  "status": "COMPLETED",
  "result": {
    "total_urls": 42,
    "urls_processed": 42,
    "chunks_generated": 156,
    "embeddings_stored": true,
    "synthesis": "AI regulation is evolving rapidly...",
    "report_url": "https://files.nominate.ai/..."
  }
}

Example¶

curl -X POST https://intel.nominate.ai/api/v1/jobs/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the latest AI safety regulations?",
    "geo": "us:ca",
    "max_urls": 30
  }'

lazarus¶

Historical archive retrieval from Internet Archive and Common Crawl.

Request Schema¶

{
  "domain": "example.com",
  "url": "https://example.com/page",
  "from_date": "2020-01-01",
  "to_date": "2024-01-01",
  "sources": ["wayback", "commoncrawl"],
  "limit": 100
}

Field	Type	Required	Default	Description
`domain`	string	No*	-	Domain to discover URLs
`url`	string	No*	-	Specific URL to retrieve
`from_date`	date	No	null	Start date
`to_date`	date	No	null	End date
`sources`	string[]	No	["wayback"]	Archive sources
`limit`	int	No	100	Max results

*Either domain or url is required.

Response Schema¶

{
  "job_id": "job_def456",
  "status": "COMPLETED",
  "result": {
    "urls_discovered": 1523,
    "snapshots_retrieved": 87,
    "date_range": {
      "earliest": "2015-03-21",
      "latest": "2024-01-10"
    },
    "output_url": "https://files.nominate.ai/..."
  }
}

vectl¶

Vector embedding generation and semantic search.

Embed Request¶

{
  "operation": "embed",
  "texts": ["First document", "Second document"],
  "model": "nomic-embed-text",
  "store": "my-index"
}

Search Request¶

{
  "operation": "search",
  "query": "machine learning algorithms",
  "store": "my-index",
  "top_k": 10
}

Field	Type	Required	Default	Description
`operation`	string	Yes	-	"embed" or "search"
`texts`	string[]	For embed	-	Texts to embed
`query`	string	For search	-	Search query
`store`	string	No	default	Vector store name
`top_k`	int	No	10	Number of results

Response Schema¶

{
  "job_id": "job_ghi789",
  "status": "COMPLETED",
  "result": {
    "operation": "search",
    "matches": [
      {"id": "doc1", "score": 0.92, "text": "..."},
      {"id": "doc2", "score": 0.87, "text": "..."}
    ]
  }
}

screenshot¶

Browser screenshot capture.

Request Schema¶

{
  "urls": ["https://example.com", "https://example.org"],
  "full_page": true,
  "format": "png",
  "viewport_width": 1920,
  "viewport_height": 1080,
  "geo": "us:ca"
}

Field	Type	Required	Default	Description
`urls`	string[]	Yes	-	URLs to capture
`full_page`	bool	No	true	Full page capture
`format`	string	No	"png"	Image format
`viewport_width`	int	No	1920	Viewport width
`viewport_height`	int	No	1080	Viewport height
`geo`	string	No	null	Geographic routing

Response Schema¶

{
  "job_id": "job_jkl012",
  "status": "COMPLETED",
  "result": {
    "urls_processed": 2,
    "format": "png",
    "screenshots": [
      {
        "url": "https://example.com",
        "file_url": "https://files.nominate.ai/...",
        "width": 1920,
        "height": 3500
      }
    ]
  }
}

transcript¶

YouTube video transcript extraction.

Request Schema¶

{
  "video_id": "dQw4w9WgXcQ",
  "language": "en",
  "include_timestamps": true
}

Field	Type	Required	Default	Description
`video_id`	string	Yes	-	YouTube video ID
`language`	string	No	"en"	Transcript language
`include_timestamps`	bool	No	true	Include timestamps

Response Schema¶

{
  "job_id": "job_mno345",
  "status": "COMPLETED",
  "result": {
    "video_id": "dQw4w9WgXcQ",
    "title": "Video Title",
    "duration_seconds": 212,
    "transcript": [
      {"start": 0.0, "text": "Never gonna give you up"},
      {"start": 3.5, "text": "Never gonna let you down"}
    ],
    "full_text": "Never gonna give you up..."
  }
}

browser¶

Ferret browser automation.

Request Schema¶

{
  "url": "https://example.com",
  "actions": [
    {"type": "fill", "selector": "input[name='q']", "value": "test"},
    {"type": "click", "selector": "button[type='submit']"},
    {"type": "wait_for_element", "selector": ".results"},
    {"type": "extract_text", "selector": ".results"}
  ],
  "timeout": 30000
}

Field	Type	Required	Default	Description
`url`	string	Yes	-	Starting URL
`actions`	object[]	Yes	-	Action sequence
`timeout`	int	No	30000	Timeout in ms
`geo`	string	No	null	Geographic routing

Action Types¶

Type	Parameters	Description
`navigate`	`url`	Navigate to URL
`click`	`selector`	Click element
`fill`	`selector`, `value`	Fill input
`select`	`selector`, `value`	Select option
`wait`	`ms`	Wait duration
`wait_for_element`	`selector`	Wait for element
`extract_text`	`selector`	Extract text
`screenshot`	-	Take screenshot

Response Schema¶

{
  "job_id": "job_pqr678",
  "status": "COMPLETED",
  "result": {
    "success": true,
    "actions_executed": 4,
    "extracted_data": {
      "results": "Search results text..."
    },
    "final_url": "https://example.com/results"
  }
}

graph¶

Research graph execution.

Request Schema¶

{
  "graph": "name: research_pipeline\nstages:\n  - name: discover\n    ...",
  "template": "basic_research",
  "params": {
    "query": "AI regulation",
    "max_urls": 50
  },
  "workspace_id": "ws_xyz789"
}

Field	Type	Required	Default	Description
`graph`	string	No*	-	Inline YAML graph
`template`	string	No*	-	Template name
`params`	object	No	{}	Graph parameters
`workspace_id`	string	No	null	Workspace for artifacts

*Either graph or template is required.

Response Schema¶

{
  "job_id": "job_stu901",
  "status": "COMPLETED",
  "result": {
    "graph_name": "research_pipeline",
    "stages_completed": 5,
    "stages_total": 5,
    "duration_seconds": 245,
    "outputs": {
      "urls": [...],
      "synthesis": "Research findings..."
    },
    "artifacts_url": "https://files.nominate.ai/..."
  }
}

Submitting Jobs¶

REST API¶

# Crawl
curl -X POST https://intel.nominate.ai/api/v1/jobs/crawl -d '...'

# Lazarus
curl -X POST https://intel.nominate.ai/api/v1/jobs/lazarus -d '...'

# Vectl
curl -X POST https://intel.nominate.ai/api/v1/jobs/vectl -d '...'

# Screenshot
curl -X POST https://intel.nominate.ai/api/v1/jobs/screenshot -d '...'

# Transcript
curl -X POST https://intel.nominate.ai/api/v1/jobs/transcript -d '...'

# Browser
curl -X POST https://intel.nominate.ai/api/v1/jobs/browser -d '...'

# Graph
curl -X POST https://intel.nominate.ai/api/v1/jobs/graph -d '...'

Python Client¶

from cbintel.client import JobsClient

client = JobsClient()

# Submit with specific type
job = await client.submit("crawl", {"query": "..."})
job = await client.submit("graph", {"template": "...", "params": {...}})