Foliant is a layout-aware PDF document-AI library for .NET — and it runs entirely on your machine. No Python sidecar, no cloud APIs, and no documents ever leaving the host.
It extracts structured content — Markdown, JSON, or typed objects — from PDFs the way commercial document-intelligence services do: layout detection, per-region OCR, table-structure recognition, and reading-order assembly, all via ONNX Runtime.
The gap it fills
Python has a rich ecosystem for document understanding (surya, marker, docling, unstructured). .NET has excellent text extraction (PdfPig, iText) but essentially nothing open-source for layout-aware understanding — region classification, reading-order inference, table-structure extraction.
The tempting shortcut is to hand whole pages to a general vision-language model. The problem: VLMs produce plausible-looking output that silently fabricates details — checkbox states, names, field values — which is disqualifying wherever the answers actually matter. Foliant instead takes the decomposition approach the commercial services use, with open models and one hard rule: never guess.
Five lines to structured output
using Foliant.Pipeline;
// Models download once into a local, SHA-256-verified cache, then run offline.
using var processor = await FoliantProcessor.CreateDefaultAsync();
var result = await processor.ProcessAsync(File.ReadAllBytes("document.pdf"));
Console.WriteLine(result.Markdown); // layout-aware Markdown
string json = result.ToJson(indented: true); // regions, tables, bounds, confidence
How a page flows through the pipeline
Every stage is an interface in Foliant.Core, so any backend can be swapped without forking:
PDF page
-> render (PDFium)
-> layout detection DocLayout-YOLO (what is where)
-> text, per region embedded layer (fast path) or PaddleOCR v5
-> table structure TableTransformer + ruling-line analysis
-> reading order XY-Cut++
-> Markdown / JSON / typed DocumentResult
Built so silent errors can’t hide
Foliant is designed for documents where a wrong answer is worse than no answer, so it treats verifiability as a feature:
- Lossless by construction. A per-page coverage invariant guarantees every extracted line lands in the output (or is explicitly reported as page furniture). Across 2,303 documents / 65,665 pages of verification, text loss is zero.
- Self-scoring. Pages with an embedded text layer are graded against it — the PDF is its own answer key. In forced-OCR mode on the 474-page federal-RFP reference corpus: 99.7% average word recall, 100% of pages at or above 95%, and zero fabricated form values.
- Deterministic. Same input, same output, every run. No temperature, no sampling.
- Private by default. No network calls at processing time — cache the models once and run air-gapped.
Rigor means publishing where the method stops working. Dynamic XFA forms and optimizer-corrupted text layers are detected and flagged — never silently filled with placeholder text.
Try it
dotnet add package Foliant.Pipeline
Or run the console sample against any PDF (requires the .NET 10 SDK):
git clone https://github.com/Nerttiyana-Technologies/Foliant.git
cd Foliant
dotnet run -c Release --project samples/Foliant.Sample.Console -- path/to/document.pdf
# -> sample-out/document.md + sample-out/document.json
Born-digital pages take the fast path at about 0.4 s/page on an Apple-silicon CPU; full-OCR pages run around 4 s/page, scaling linearly with cores — no GPU required.
Foliant is Apache-2.0, and its 1.0 public API is frozen under Semantic Versioning. Explore the code and the full 18-corpus evidence ledger on GitHub.
