Type: Software
Status: Alpha
Tech Stack: Python, FastAPI, SQLAlchemy (async), PostgreSQL 16, Redis 7, Celery, Docling (OCR), Ollama/vLLM/OpenAI, Alembic, JWT, React, Docker Compose
Problem Statement
Organizations process large volumes of heterogeneous documents daily (invoices, contracts, forms, reports) that must be manually read, classified, and entered into systems. OCR alone is not enough — extracted texts must be semantically understood, validated, and stored in a structured manner. Existing DMS solutions offer no integrated AI classification with multiple interchangeable AI providers and type-specific extraction pipelines.
Description
A system for automated processing and structuring of diverse documents. Documents are digitized via OCR (Docling microservice), classified via AI analysis, and processed through specialized extraction pipelines depending on document type — with validation, plausibility checking, and structured storage. Supports multiple AI providers (Ollama local, vLLM, OpenAI) with configurable switching, JWT authentication, role-based access control, Celery-based background processing, XML templates, MCP server integration, and an admin API. GDPR-relevant audit logging and network isolation (backend/public/admin) via Docker.
Use Case
Scan documents and have AI automatically extract important data and classify them.
Link: https://github.com/rawk7000/DocStar (private repo)