Documents API

Turn any document into AI-ready content

Extract text, tables, and images from PDFs, Office files, and images through one API. Advanced OCR turns scanned documents into clean Markdown, JSON, or HTML, ready for RAG pipelines, digitization, and invoice processing.

Create an account 14-day free trial (opens in a new tab) Read the docs

Why GreenPT

Document AI without the data tradeoff

Most document APIs ask you to send sensitive files to infrastructure you do not control. GreenPT processes your documents on private, EU-hosted, renewable-powered infrastructure, and never trains on your data.

EU-hosted and GDPR-aligned, so sensitive documents stay in Europe.
Private by design: your files are never used to train models.
Wide format coverage in a single API, from PDFs to spreadsheets to images.
Built-in OCR reads scans and image-based documents.
Fast or accurate table modes to match speed against precision.
Structured DoclingDocument JSON that drops straight into RAG pipelines.

Capabilities

One API for every document

Wide format support

Process PDFs, Word, PowerPoint, Excel, CSV, HTML, and common image types through a single endpoint, no per-format plumbing.
OCR for scans and images

Advanced OCR reads scanned documents and images with embedded text, turning pixels into clean, searchable content.
Table extraction

Detect and reconstruct table structure from reports and spreadsheets, with a fast mode for speed or an accurate mode for precision.
Multiple output formats

Get results as Markdown, JSON, HTML, HTML split by page, plain text, or DocTags. Pick one or request several at once.
Image extraction

Pull embedded images out alongside the text so figures and diagrams are not lost in conversion.
Structured JSON for RAG

Receive a structured DoclingDocument schema with texts, tables, pictures, and pages, ready to chunk and embed for retrieval.

Formats

Many inputs in, clean structure out

Send the documents you already have. Get back the format your application needs, whether that is Markdown for an LLM or structured JSON for a pipeline.

Input formats

Documents

.pdf
.docx
.pptx
.xlsx
.csv
.md
.html

Images

.png
.jpg
.tiff
.bmp
.webp

Special

.vtt
.xml
.json

Output formats

Markdown
JSON
HTML
HTML by page
Plain text
DocTags

Use cases

Built for document-heavy workflows

RAG pipelines

Convert source documents into clean, structured text so your retrieval and embedding steps start from quality input.
Invoice processing

Automate extraction of data from invoices and receipts, including the tables that hold line items and totals.
Document digitization

Turn scanned archives and legacy PDFs into searchable, machine-readable text for indexing and reuse.
Data extraction

Lift tables out of financial reports and spreadsheets into structured formats your systems can consume.
Academic research

Process research papers with their formulas, citations, and figures intact for analysis or summarization.
Accessibility

Make image-based documents accessible by extracting their text, so screen readers and search can reach the content.

Documents API, in short

Which file formats can I send?

PDFs, Microsoft Office files (Word, PowerPoint, Excel), CSV, Markdown, HTML, and common image types including PNG, JPEG, TIFF, BMP, and WebP. Special formats like VTT, XML, and JSON are supported too.

What output formats do I get back?

Markdown (the default), structured JSON in the DoclingDocument schema, HTML, HTML split by page, plain text, or DocTags. You can request one format or several in a single call.

Can it read scanned documents?

Yes. Built-in OCR converts scanned documents and images with embedded text into clean, structured output. You can also force OCR to replace existing text when a source PDF has unreliable text layers.

Does it extract tables?

Yes. Table structure detection is on by default, with a fast mode for speed and an accurate mode for complex layouts. Embedded images can be extracted alongside the text.

Is my data private, and where is it processed?

The Documents API runs on GreenPT’s private, EU-hosted, renewable-powered infrastructure. Your files are processed to fulfil your request and are not used to train models.

Read the API docs →

Start building

Make every document AI-ready .

Send your first file in minutes. Convert PDFs, scans, and Office documents into clean, structured content on private, EU-hosted infrastructure.

Read the API docs cURL, JavaScript, Python (opens in a new tab) Talk to us

100% Renewable
EU Hosted
GDPR-aligned

Turn any document into AI-ready content

Document AI without the data tradeoff

Wide format support

OCR for scans and images

Table extraction

Multiple output formats

Image extraction

Structured JSON for RAG

Input formats

Output formats

RAG pipelines

Invoice processing

Document digitization

Data extraction

Academic research

Accessibility

Documents API, in short

Make every document AI-ready .