Sample: Document Data Extraction from Supplier PDFs
This worked example shows how the document data extraction service turned a pile of mixed supplier documents into one traceable table for a small manufacturer. It covers what was sent, what came back, and what the human reviewer corrected.
The documents
The manufacturer sent nine documents in three shapes:
- Five product spec PDFs, the useful numbers buried in tables.
- Two scanned datasheets, lower quality and partly faint.
- Two purchase confirmations.
They wanted five fields per part — part number, description, unit price, lead time, and minimum order quantity — pulled into one sheet they could actually compare.
What came back
The service returned a single table, one row per part, with each cell carrying its source document and page. That traceability matters: when a price looks wrong later, the manufacturer can jump straight to the document it came from instead of re-reading nine files.
Where a value was genuinely unclear — three reads from the faint datasheets — it was flagged for confirmation, not filled with a confident-looking guess.
What the reviewer corrected
A human reviewer checked the extraction against the source pages:
- A scanned price had a decimal place in the wrong spot; the reviewer fixed it against the original.
- Two part numbers had merged across a table cell during reading; the reviewer split them back apart.
- Three faint datasheet values stayed uncertain, so they were flagged with the exact field rather than guessed.
The rule is simple: a wrong number that looks right is worse than a flagged blank. Review enforces that.
The deliverable
The manufacturer got one comparable parts table, every figure traceable to its source, and three clearly marked values to confirm. No reading nine PDFs side by side, no silent errors hiding in a tidy-looking sheet.
Document sets vary in volume and quality, and how deeply each value is checked is agreed at intake.