Skip to content

Documents API

Turn any document into AI-ready content

Extract text, tables, and images from PDFs, Office files, and images through one API. Advanced OCR turns scanned documents into clean Markdown, JSON, or HTML, ready for RAG pipelines, digitization, and invoice processing.

Why GreenPT

Document AI without the data tradeoff

Most document APIs ask you to send sensitive files to infrastructure you do not control. GreenPT processes your documents on private, EU-hosted, renewable-powered infrastructure, and never trains on your data.

  • EU-hosted and GDPR-aligned, so sensitive documents stay in Europe.

  • Private by design: your files are never used to train models.

  • Wide format coverage in a single API, from PDFs to spreadsheets to images.

  • Built-in OCR reads scans and image-based documents.

  • Fast or accurate table modes to match speed against precision.

  • Structured DoclingDocument JSON that drops straight into RAG pipelines.

Capabilities

One API for every document

  • Wide format support

    Process PDFs, Word, PowerPoint, Excel, CSV, HTML, and common image types through a single endpoint, no per-format plumbing.

  • OCR for scans and images

    Advanced OCR reads scanned documents and images with embedded text, turning pixels into clean, searchable content.

  • Table extraction

    Detect and reconstruct table structure from reports and spreadsheets, with a fast mode for speed or an accurate mode for precision.

  • Multiple output formats

    Get results as Markdown, JSON, HTML, HTML split by page, plain text, or DocTags. Pick one or request several at once.

  • Image extraction

    Pull embedded images out alongside the text so figures and diagrams are not lost in conversion.

  • Structured JSON for RAG

    Receive a structured DoclingDocument schema with texts, tables, pictures, and pages, ready to chunk and embed for retrieval.

Formats

Many inputs in, clean structure out

Send the documents you already have. Get back the format your application needs, whether that is Markdown for an LLM or structured JSON for a pipeline.

Input formats

Documents

  • .pdf
  • .docx
  • .pptx
  • .xlsx
  • .csv
  • .md
  • .html

Images

  • .png
  • .jpg
  • .tiff
  • .bmp
  • .webp

Special

  • .vtt
  • .xml
  • .json

Output formats

  • Markdown
  • JSON
  • HTML
  • HTML by page
  • Plain text
  • DocTags

Use cases

Built for document-heavy workflows

  • RAG pipelines

    Convert source documents into clean, structured text so your retrieval and embedding steps start from quality input.

  • Invoice processing

    Automate extraction of data from invoices and receipts, including the tables that hold line items and totals.

  • Document digitization

    Turn scanned archives and legacy PDFs into searchable, machine-readable text for indexing and reuse.

  • Data extraction

    Lift tables out of financial reports and spreadsheets into structured formats your systems can consume.

  • Academic research

    Process research papers with their formulas, citations, and figures intact for analysis or summarization.

  • Accessibility

    Make image-based documents accessible by extracting their text, so screen readers and search can reach the content.

Documents API, in short

Which file formats can I send?

PDFs, Microsoft Office files (Word, PowerPoint, Excel), CSV, Markdown, HTML, and common image types including PNG, JPEG, TIFF, BMP, and WebP. Special formats like VTT, XML, and JSON are supported too.

What output formats do I get back?

Markdown (the default), structured JSON in the DoclingDocument schema, HTML, HTML split by page, plain text, or DocTags. You can request one format or several in a single call.

Can it read scanned documents?

Yes. Built-in OCR converts scanned documents and images with embedded text into clean, structured output. You can also force OCR to replace existing text when a source PDF has unreliable text layers.

Does it extract tables?

Yes. Table structure detection is on by default, with a fast mode for speed and an accurate mode for complex layouts. Embedded images can be extracted alongside the text.

Is my data private, and where is it processed?

The Documents API runs on GreenPT’s private, EU-hosted, renewable-powered infrastructure. Your files are processed to fulfil your request and are not used to train models.

Read the API docs →

Start building

Make every document AI-ready .

Send your first file in minutes. Convert PDFs, scans, and Office documents into clean, structured content on private, EU-hosted infrastructure.

  • 100% Renewable
  • EU Hosted
  • GDPR-aligned