Engineering

Foliant: Layout-aware PDF document AI for .NET

Foliant: Layout-aware PDF document AI for .NET

Foliant is a layout-aware PDF document-AI library for .NET — and it runs entirely on your machine. No Python sidecar, no cloud APIs, and no documents ever leaving the host.

It extracts structured content — Markdown, JSON, or typed objects — from PDFs the way commercial document-intelligence services do: layout detection, per-region OCR, table-structure recognition, and reading-order assembly, all via ONNX Runtime.

The gap it fills

Python has a rich ecosystem for document understanding (surya, marker, docling, unstructured). .NET has excellent text extraction (PdfPig, iText) but essentially nothing open-source for layout-aware understanding — region classification, reading-order inference, table-structure extraction.

The tempting shortcut is to hand whole pages to a general vision-language model. The problem: VLMs produce plausible-looking output that silently fabricates details — checkbox states, names, field values — which is disqualifying wherever the answers actually matter. Foliant instead takes the decomposition approach the commercial services use, with open models and one hard rule: never guess.

Five lines to structured output

using Foliant.Pipeline;

// Models download once into a local, SHA-256-verified cache, then run offline.
using var processor = await FoliantProcessor.CreateDefaultAsync();
var result = await processor.ProcessAsync(File.ReadAllBytes("document.pdf"));

Console.WriteLine(result.Markdown);          // layout-aware Markdown
string json = result.ToJson(indented: true); // regions, tables, bounds, confidence

How a page flows through the pipeline

Every stage is an interface in Foliant.Core, so any backend can be swapped without forking:

PDF page
  -> render (PDFium)
  -> layout detection      DocLayout-YOLO      (what is where)
  -> text, per region      embedded layer (fast path) or PaddleOCR v5
  -> table structure       TableTransformer + ruling-line analysis
  -> reading order         XY-Cut++
  -> Markdown / JSON / typed DocumentResult

Built so silent errors can’t hide

Foliant is designed for documents where a wrong answer is worse than no answer, so it treats verifiability as a feature:

  • Lossless by construction. A per-page coverage invariant guarantees every extracted line lands in the output (or is explicitly reported as page furniture). Across 2,303 documents / 65,665 pages of verification, text loss is zero.
  • Self-scoring. Pages with an embedded text layer are graded against it — the PDF is its own answer key. In forced-OCR mode on the 474-page federal-RFP reference corpus: 99.7% average word recall, 100% of pages at or above 95%, and zero fabricated form values.
  • Deterministic. Same input, same output, every run. No temperature, no sampling.
  • Private by default. No network calls at processing time — cache the models once and run air-gapped.

Rigor means publishing where the method stops working. Dynamic XFA forms and optimizer-corrupted text layers are detected and flagged — never silently filled with placeholder text.

Try it

dotnet add package Foliant.Pipeline

Or run the console sample against any PDF (requires the .NET 10 SDK):

git clone https://github.com/Nerttiyana-Technologies/Foliant.git
cd Foliant
dotnet run -c Release --project samples/Foliant.Sample.Console -- path/to/document.pdf
# -> sample-out/document.md + sample-out/document.json

Born-digital pages take the fast path at about 0.4 s/page on an Apple-silicon CPU; full-OCR pages run around 4 s/page, scaling linearly with cores — no GPU required.

Foliant is Apache-2.0, and its 1.0 public API is frozen under Semantic Versioning. Explore the code and the full 18-corpus evidence ledger on GitHub.