PeerHosting / APIs / PDF Text Extractor
Live
PDF Text Extractor - URL to clean text, per page
Give it public PDF URLs, get back clean text and document metadata. One block per page or per document, batch-capable, and callable as a synchronous API so AI agents and automations can extract PDFs on demand. No OCR needed for digital PDFs, no upload step, no key.
What it does
- Fetches each PDF URL with redirects followed and a 60-second timeout
- Extracts text page by page with line reconstruction - not one giant word soup
- Reads the document's own metadata (title, author, producer, dates) as published in the file
- One structured record per document, with per-page text blocks if you want them
- Failed URLs produce an error record instead of killing the batch
What you get per document
{
"url": "https://arxiv.org/pdf/1706.03762",
"pageCount": 15,
"pagesExtracted": 15,
"truncated": false,
"metadata": { "producer": "pdfTeX", "creationDate": "..." },
"pages": [
{ "page": 1, "text": "Attention Is All You Need\n..." }
]
}
Use cases
- RAG and AI pipelines - turn report URLs into page-aligned chunks for embedding
- Agents - call the standby endpoint as a tool: "read this PDF and answer"
- Document monitoring - pair with a schedule to extract recurring reports, filings, price lists
- Research - batch-extract paper PDFs into searchable text
Pricing
| Event | Price (USD) |
|---|---|
| Run start | $0.0005 |
| Per page extracted | $0.002 |
| API call (standby) | $0.02 |
Comparable actors charge $0.022-0.04 per page. A 100-page report here costs about $0.20. The live price on the Apify page is authoritative.
Runs on Apify - free account, pay per page, batch or synchronous standby endpoint for agents.
Run on Apify →Related: Structured Data Extractor · SEC EDGAR Filing Monitor