Document Processing Service (cbunstruct)¶
Status: Active URL: https://unstruct.nominate.ai Port: 32210
Overview¶
cbunstruct is a fork of Unstructured API deployed for Nominate-AI. It's a FastAPI-based REST API that wraps the unstructured library for document partitioning, processing various document formats (PDF, DOCX, images, emails, etc.) and extracting structured elements.
Quick Start¶
# Install all dependencies
make install
# Run the API locally (port 8000)
make run-web-app
# Run tests
make test
# Lint and format
make check
make tidy
API Usage¶
Main Endpoint¶
curl -X POST https://unstruct.nominate.ai/general/v0/general \
-H "Content-Type: multipart/form-data" \
-F "files=@document.pdf" \
-F "strategy=fast"
Processing Strategies¶
| Strategy | Description |
|---|---|
fast |
Default, no OCR, extracts embedded text |
hi_res |
ML-based layout detection, table extraction |
ocr_only |
Full Tesseract OCR |
auto |
Chooses per page based on content |
Architecture¶
Core Package: prepline_general/api/¶
| File | Purpose |
|---|---|
app.py |
FastAPI entry point, CORS, exception handlers |
general.py |
Main /general/v0/general endpoint |
models/form_params.py |
GeneralFormParams Pydantic model |
filetypes.py |
File type validation |
utils.py |
SmartValueParser for form parameters |
Key Dependencies¶
- unstructured[all-docs] - Core document partitioning
- unstructured-inference - ML models for hi_res strategy
- FastAPI/uvicorn - Web framework
- pypdf - PDF reading and splitting
ML Models¶
Models are downloaded automatically on first use to ~/.cache/huggingface/hub/:
| Model | Purpose |
|---|---|
| YOLOX | Default hi_res layout detection (~25MB ONNX) |
| Table Transformer | Table structure recognition |
| Detectron2 ONNX | Alternative layout model |
Note: Chipper model only available via Unstructured's hosted API.
Deployment¶
Systemd Service¶
# Service management
sudo systemctl status cbunstruct
sudo systemctl restart cbunstruct
sudo journalctl -u cbunstruct -f
API Testing¶
# Run API test suite
python scripts/api_tests.py
# Skip slow hi_res tests
python scripts/api_tests.py --skip-hires
# Test against production
python scripts/api_tests.py --url https://unstruct.nominate.ai
Configuration¶
| Variable | Description |
|---|---|
UNSTRUCTURED_API_KEY |
Enable API key validation |
UNSTRUCTURED_PARALLEL_MODE_ENABLED |
Enable PDF parallel processing |
UNSTRUCTURED_PARALLEL_MODE_THREADS |
Thread count (default: 3) |
UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB |
Min free memory (default: 2048) |
MAX_LIFETIME_SECONDS |
Server auto-shutdown timer |
Supported Formats¶
- PDF (with or without OCR)
- Microsoft Office (DOCX, XLSX, PPTX)
- Images (PNG, JPG, TIFF)
- Email (EML, MSG)
- HTML, Markdown, RST
- Plain text
- And more...