Document Parsing & Data Cleansing

Parse structured data from PDFs and use AI to fix extraction errors. The documents followed predictable formats with minor variations, but even small errors compounded quickly downstream.

A client needed to extract structured data from a set of PDF documents and load it into their operational database. The PDF formats were consistent, with variations in the single digits, but the data entry process was manual, slow, and error-prone.

Challenge

Rule-based PDF parsing could handle the basic extraction, but consistently missed edge cases: formatting inconsistencies, OCR artifacts, and structural variations that were rare individually but frequent in aggregate. These errors went undetected until they caused problems in downstream reporting. The client needed a system that could catch what rule-based parsing could not, without requiring a large infrastructure investment.

What We Built

We used Python-based PDF parsing as a first pass, then an LLM to clean and correct the extracted data. Because the document formats were constrained, the AI layer stayed narrow and focused on error correction rather than open-ended interpretation. Validation compares results against a curated golden dataset, routing mismatches to human review.

The infrastructure was minimal: the Python layer lives on the same server as the client's FileMaker system, keeping costs low. Half the 150 hours of effort went into building and validating the golden dataset. The other half covered logic, prompts, and AI unit tests.

What Changed

Data extraction errors dropped significantly, and the manual review burden was reduced to a small percentage of documents the system flags as low-confidence. Even on a project this contained, the golden dataset was the foundation that made everything else trustworthy. Even "simple" AI extraction projects require rigorous validation data.