PDF Text Extractor - URL to clean text, per page

Give it public PDF URLs, get back clean text and document metadata. One block per page or per document, batch-capable, and callable as a synchronous API so AI agents and automations can extract PDFs on demand. No OCR needed for digital PDFs, no upload step, no key.

What it does

Fetches each PDF URL with redirects followed and a 60-second timeout
Extracts text page by page with line reconstruction - not one giant word soup
Reads the document's own metadata (title, author, producer, dates) as published in the file
One structured record per document, with per-page text blocks if you want them
Failed URLs produce an error record instead of killing the batch

What you get per document

{
  "url": "https://arxiv.org/pdf/1706.03762",
  "pageCount": 15,
  "pagesExtracted": 15,
  "truncated": false,
  "metadata": { "producer": "pdfTeX", "creationDate": "..." },
  "pages": [
    { "page": 1, "text": "Attention Is All You Need\n..." }
  ]
}

Use cases

RAG and AI pipelines - turn report URLs into page-aligned chunks for embedding
Agents - call the standby endpoint as a tool: "read this PDF and answer"
Document monitoring - pair with a schedule to extract recurring reports, filings, price lists
Research - batch-extract paper PDFs into searchable text

Pricing

Event	Price (USD)
Run start	$0.0005
Per page extracted	$0.002
API call (standby)	$0.02

Comparable actors charge $0.022-0.04 per page. A 100-page report here costs about $0.20. The live price on the Apify page is authoritative.

Runs on Apify - free account, pay per page, batch or synchronous standby endpoint for agents.

Run on Apify →