How to Extract Invoice Line Items and Structured Tables
Extracting line items and tables from invoices is key for full automation of accounts payable, audits, and spend analytics. Aspose.OCR Invoice to Text for .NET enables you to parse detailed, multi-row, multi-column data—even from scanned or photographed documents.
Real-World Problem
Invoices contain tables of products/services—manually transcribing these is time-consuming and error-prone. Full automation requires robust extraction of itemized details.
Solution Overview
Use OCR to recognize table regions, parse each row and column, and export to structured formats for ERP, BI, or further analysis.
Prerequisites
- Visual Studio 2019 or later
- .NET 6.0 or later (or .NET Framework 4.6.2+)
- Aspose.OCR for .NET from NuGet
- Sample invoice images or PDFs with line items/tables
PM> Install-Package Aspose.OCR
Step-by-Step Implementation
Step 1: Prepare Invoice Image/PDF
string invoiceFile = "invoice_with_items.pdf";
Step 2: Recognize Table/Line Item Regions
using Aspose.OCR;
InvoiceRecognitionSettings settings = new InvoiceRecognitionSettings();
settings.Language = Language.English;
AsposeOcr ocr = new AsposeOcr();
OcrInput input = new OcrInput(InputType.PDF);
input.Add(invoiceFile);
List<RecognitionResult> results = ocr.RecognizeInvoice(input, settings);
string fullText = results[0].RecognitionText;
Step 3: Parse Recognized Text into Table Rows/Columns
- Use regex or custom logic to split out line items by row/column delimiters
// Example: Split into lines, then columns (simplified)
string[] lines = fullText.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
{
if (Regex.IsMatch(line, @"\d+\s+[A-Za-z].*\s+\d+[.,]\d{2}")) // crude line item match
{
string[] columns = line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
// Map columns: SKU, description, qty, price, total, etc.
}
}
Step 4: Export Line Items/Table to CSV
using (var writer = new StreamWriter("invoice_lineitems.csv"))
{
writer.WriteLine("SKU,Description,Qty,UnitPrice,Total");
// Loop and write line items parsed above
}
Step 5: Complete Example
using Aspose.OCR;
using System;
using System.IO;
using System.Text.RegularExpressions;
using System.Collections.Generic;
class Program
{
static void Main(string[] args)
{
string invoiceFile = "invoice_with_items.pdf";
InvoiceRecognitionSettings settings = new InvoiceRecognitionSettings();
settings.Language = Language.English;
AsposeOcr ocr = new AsposeOcr();
OcrInput input = new OcrInput(InputType.PDF);
input.Add(invoiceFile);
List<RecognitionResult> results = ocr.RecognizeInvoice(input, settings);
string fullText = results[0].RecognitionText;
var lineItems = new List<string[]>();
string[] lines = fullText.Split(new[] { '\n' }, StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
{
if (Regex.IsMatch(line, @"\d+\s+[A-Za-z].*\s+\d+[.,]\d{2}"))
lineItems.Add(line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries));
}
using (var writer = new StreamWriter("invoice_lineitems.csv"))
{
writer.WriteLine("SKU,Description,Qty,UnitPrice,Total");
foreach (var item in lineItems)
writer.WriteLine(string.Join(",", item));
}
}
}
Use Cases and Applications
Spend Analytics and AP Automation
Extract itemized spend for reporting, forecasting, and approval.
Invoice Audit and Review
Streamline audit/approval with exportable, machine-readable details.
ERP/Finance System Integration
Directly load structured table data into financial or ERP software.
Common Challenges and Solutions
Challenge 1: Varied Table Formats
Solution: Tune regex and parsing logic for each supplier/template.
Challenge 2: OCR Recognition Errors in Columns
Solution: Use column heuristics, request higher-quality scans, or flag for manual review.
Challenge 3: Merged or Missing Columns
Solution: Normalize, split, or prompt review on ambiguous cases.
Performance Considerations
- Batch process multiple invoices in parallel
- Log/flag parsing issues for human review
Best Practices
- Build regex library for known templates
- Review sample outputs to tune parsing
- Secure original/processed files for audit
- Update extraction logic as suppliers/templates change
Advanced Scenarios
Scenario 1: Map Line Items to ERP/Database Directly
Use ORM or API calls to push extracted table data.
Scenario 2: Handle Multi-Page Tables
Extend logic to parse across PDF/image page breaks.
Conclusion
Aspose.OCR Invoice to Text for .NET makes it possible to extract detailed invoice line items and tables—enabling full automation from scan/photo to structured, actionable data.
See more structured extraction code in the Aspose.OCR for .NET API Reference .