How to Extract Invoice Line Items and Structured Tables

How to Extract Invoice Line Items and Structured Tables

Extracting line items and tables from invoices is key for full automation of accounts payable, audits, and spend analytics. Aspose.OCR Invoice to Text for .NET enables you to parse detailed, multi-row, multi-column data—even from scanned or photographed documents.

Real-World Problem

Invoices contain tables of products/services—manually transcribing these is time-consuming and error-prone. Full automation requires robust extraction of itemized details.

Solution Overview

Use OCR to recognize table regions, parse each row and column, and export to structured formats for ERP, BI, or further analysis.


Prerequisites

  1. Visual Studio 2019 or later
  2. .NET 6.0 or later (or .NET Framework 4.6.2+)
  3. Aspose.OCR for .NET from NuGet
  4. Sample invoice images or PDFs with line items/tables
PM> Install-Package Aspose.OCR

Step-by-Step Implementation

Step 1: Prepare Invoice Image/PDF

string invoiceFile = "invoice_with_items.pdf";

Step 2: Recognize Table/Line Item Regions

using Aspose.OCR;
InvoiceRecognitionSettings settings = new InvoiceRecognitionSettings();
settings.Language = Language.English;
AsposeOcr ocr = new AsposeOcr();
OcrInput input = new OcrInput(InputType.PDF);
input.Add(invoiceFile);
List<RecognitionResult> results = ocr.RecognizeInvoice(input, settings);
string fullText = results[0].RecognitionText;

Step 3: Parse Recognized Text into Table Rows/Columns

  • Use regex or custom logic to split out line items by row/column delimiters
// Example: Split into lines, then columns (simplified)
string[] lines = fullText.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
{
    if (Regex.IsMatch(line, @"\d+\s+[A-Za-z].*\s+\d+[.,]\d{2}")) // crude line item match
    {
        string[] columns = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
        // Map columns: SKU, description, qty, price, total, etc.
    }
}

Step 4: Export Line Items/Table to CSV

using (var writer = new StreamWriter("invoice_lineitems.csv"))
{
    writer.WriteLine("SKU,Description,Qty,UnitPrice,Total");
    // Loop and write line items parsed above
}

Step 5: Complete Example

using Aspose.OCR;
using System;
using System.IO;
using System.Text.RegularExpressions;
using System.Collections.Generic;

class Program
{
    static void Main(string[] args)
    {
        string invoiceFile = "invoice_with_items.pdf";
        InvoiceRecognitionSettings settings = new InvoiceRecognitionSettings();
        settings.Language = Language.English;
        AsposeOcr ocr = new AsposeOcr();
        OcrInput input = new OcrInput(InputType.PDF);
        input.Add(invoiceFile);
        List<RecognitionResult> results = ocr.RecognizeInvoice(input, settings);
        string fullText = results[0].RecognitionText;
        var lineItems = new List<string[]>();
        string[] lines = fullText.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries);
        foreach (var line in lines)
        {
            if (Regex.IsMatch(line, @"\d+\s+[A-Za-z].*\s+\d+[.,]\d{2}"))
                lineItems.Add(line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries));
        }
        using (var writer = new StreamWriter("invoice_lineitems.csv"))
        {
            writer.WriteLine("SKU,Description,Qty,UnitPrice,Total");
            foreach (var item in lineItems)
                writer.WriteLine(string.Join(",", item));
        }
    }
}

Use Cases and Applications

Spend Analytics and AP Automation

Extract itemized spend for reporting, forecasting, and approval.

Invoice Audit and Review

Streamline audit/approval with exportable, machine-readable details.

ERP/Finance System Integration

Directly load structured table data into financial or ERP software.


Common Challenges and Solutions

Challenge 1: Varied Table Formats

Solution: Tune regex and parsing logic for each supplier/template.

Challenge 2: OCR Recognition Errors in Columns

Solution: Use column heuristics, request higher-quality scans, or flag for manual review.

Challenge 3: Merged or Missing Columns

Solution: Normalize, split, or prompt review on ambiguous cases.


Performance Considerations

  • Batch process multiple invoices in parallel
  • Log/flag parsing issues for human review

Best Practices

  1. Build regex library for known templates
  2. Review sample outputs to tune parsing
  3. Secure original/processed files for audit
  4. Update extraction logic as suppliers/templates change

Advanced Scenarios

Scenario 1: Map Line Items to ERP/Database Directly

Use ORM or API calls to push extracted table data.

Scenario 2: Handle Multi-Page Tables

Extend logic to parse across PDF/image page breaks.


Conclusion

Aspose.OCR Invoice to Text for .NET makes it possible to extract detailed invoice line items and tables—enabling full automation from scan/photo to structured, actionable data.

See more structured extraction code in the Aspose.OCR for .NET API Reference .

 English