Skip to content

Document Processing Service (cbunstruct)

Status: Active URL: https://unstruct.nominate.ai Port: 32210

Overview

cbunstruct is a fork of Unstructured API deployed for Nominate-AI. It's a FastAPI-based REST API that wraps the unstructured library for document partitioning, processing various document formats (PDF, DOCX, images, emails, etc.) and extracting structured elements.

Quick Start

# Install all dependencies
make install

# Run the API locally (port 8000)
make run-web-app

# Run tests
make test

# Lint and format
make check
make tidy

API Usage

Main Endpoint

curl -X POST https://unstruct.nominate.ai/general/v0/general \
  -H "Content-Type: multipart/form-data" \
  -F "files=@document.pdf" \
  -F "strategy=fast"

Processing Strategies

Strategy Description
fast Default, no OCR, extracts embedded text
hi_res ML-based layout detection, table extraction
ocr_only Full Tesseract OCR
auto Chooses per page based on content

Architecture

Core Package: prepline_general/api/

File Purpose
app.py FastAPI entry point, CORS, exception handlers
general.py Main /general/v0/general endpoint
models/form_params.py GeneralFormParams Pydantic model
filetypes.py File type validation
utils.py SmartValueParser for form parameters

Key Dependencies

  • unstructured[all-docs] - Core document partitioning
  • unstructured-inference - ML models for hi_res strategy
  • FastAPI/uvicorn - Web framework
  • pypdf - PDF reading and splitting

ML Models

Models are downloaded automatically on first use to ~/.cache/huggingface/hub/:

Model Purpose
YOLOX Default hi_res layout detection (~25MB ONNX)
Table Transformer Table structure recognition
Detectron2 ONNX Alternative layout model

Note: Chipper model only available via Unstructured's hosted API.

Deployment

Systemd Service

# Service management
sudo systemctl status cbunstruct
sudo systemctl restart cbunstruct
sudo journalctl -u cbunstruct -f

API Testing

# Run API test suite
python scripts/api_tests.py

# Skip slow hi_res tests
python scripts/api_tests.py --skip-hires

# Test against production
python scripts/api_tests.py --url https://unstruct.nominate.ai

Configuration

Variable Description
UNSTRUCTURED_API_KEY Enable API key validation
UNSTRUCTURED_PARALLEL_MODE_ENABLED Enable PDF parallel processing
UNSTRUCTURED_PARALLEL_MODE_THREADS Thread count (default: 3)
UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB Min free memory (default: 2048)
MAX_LIFETIME_SECONDS Server auto-shutdown timer

Supported Formats

  • PDF (with or without OCR)
  • Microsoft Office (DOCX, XLSX, PPTX)
  • Images (PNG, JPG, TIFF)
  • Email (EML, MSG)
  • HTML, Markdown, RST
  • Plain text
  • And more...