How We Built a Document OCR Pipeline That Actually Works for International Trade

If you've ever tried to extract structured data from a bill of lading using a general-purpose OCR service, you already know the punchline: it doesn't work. Not reliably, anyway. And in international trade compliance, "not reliably" is the same as "not at all."

At Atlas Verify, we process thousands of trade documents — bills of lading, organic certificates, phytosanitary certificates, certificates of origin, commercial invoices, packing lists, and dozens more. Early on, we evaluated every major OCR platform on the market. They all failed the same test: give them a document they haven't been specifically trained on, and the output ranges from incomplete to dangerously wrong.

This post explains why we built a custom document processing pipeline, what makes trade documents uniquely difficult, and the engineering principles that guide our approach.

The Document Diversity Problem

Most OCR services are optimized for a narrow set of document types. Google Document AI excels at invoices and receipts. AWS Textract handles structured forms well. Azure Form Recognizer can be trained on custom templates. But international trade generates a staggering variety of paperwork.

In our system alone, we handle over 35 distinct document types. A bill of lading looks nothing like an organic certificate. A phytosanitary certificate from the USDA shares almost no structural similarity with one issued by the EU. An arrival notice from one shipping line may be a structured PDF with clean tables; from another, it's a scanned fax with handwritten annotations.

The critical insight is that each document type has its own schema — a specific set of fields that matter for compliance verification. A bill of lading needs shipper, consignee, notify party, container numbers, vessel name, ports of loading and discharge, and commodity descriptions. An organic certificate needs the certifying agent, operation name, NOP ID, certified products, and effective dates.

Generic OCR tools don't understand these schemas. They extract text. They might identify tables. But they don't know that the string "MEDU4107760" is a container number, or that "NOP ID: 7880315519" is a critical identifier that needs to be verified against the USDA Organic Integrity Database. Without schema awareness, OCR is just expensive text conversion.

Why "Good Enough" Is Dangerous

In most software applications, a 95% accuracy rate is excellent. In trade compliance, it's a liability.

Consider a container number. The standard format (ISO 6346) is four letters followed by seven digits, with the last digit being a check digit. If OCR misreads a single character — a "0" becomes an "O", an "8" becomes a "B" — the container number is invalid. Downstream systems that attempt to track that container will return no results, and the verification chain breaks silently.

The same principle applies to NOP IDs, HS codes, IMO numbers, and certification dates. A single character error can change the meaning entirely. An HS code of 0901.11 (coffee, not roasted, not decaffeinated) versus 0901.12 (coffee, not roasted, decaffeinated) triggers different tariff rates, different regulatory requirements, and different risk profiles.

Our Architecture

Our document processing pipeline has three stages, each designed to maximize accuracy for the specific challenges of trade documentation.

Stage 1: Document Classification — Before extracting any data, we first identify what type of document we're looking at. This classification determines which extraction schema to apply, what fields to look for, and how to validate the results.

Stage 2: Schema-Aware Extraction — With the document classified, we apply a type-specific extraction model that knows exactly what fields to look for and where they typically appear in that document type. This is fundamentally different from generic OCR, which extracts everything and leaves the interpretation to the user.

Stage 3: Cross-Validation — Extracted data is automatically cross-referenced against external data sources. Container numbers are validated against shipping line APIs. NOP IDs are checked against the USDA Organic Integrity Database. Port codes are verified against the World Port Index. This stage catches both OCR errors and fraudulent data.

Lessons Learned

After processing tens of thousands of trade documents, several principles have become clear:

1. Domain-specific beats general-purpose. Every attempt to use a general OCR service required so much post-processing that we spent more engineering time fixing its output than we would have spent building a custom solution.

2. Classification is the foundation. Getting the document type wrong cascades into every subsequent step. We invest heavily in classification accuracy because everything downstream depends on it.

3. Validation is more important than extraction. Extracting a container number is useful. Confirming that it's a real container on a real vessel at the correct port is what actually matters for compliance.

4. Structured output enables automation. The goal is not to digitize documents for human reading. It's to produce machine-readable structured data that feeds directly into automated verification workflows.

Building a reliable document processing pipeline for international trade is genuinely hard. The documents are diverse, the accuracy requirements are extreme, and the consequences of errors are significant. But it's also foundational — every verification capability we build depends on being able to accurately read and understand the documents that move with goods through global supply chains.

How We Built a Document OCR Pipeline That Actually Works for International Trade

The Document Diversity Problem

Why "Good Enough" Is Dangerous

Our Architecture

Lessons Learned

Enjoyed this article?

Solutions

Comparisons

Resources & Connect