How to Extract Text from PDFs in .NET

How to Extract Text from PDFs in .NET

This article demonstrates how to extract text from PDF documents using the Aspose.PDF Text Extractor for .NET. You’ll learn to use all supported extraction modes—Pure, Raw, and Plain—and automate workflows for individual or multiple PDFs.

Real-World Problem

Manually copying text from PDFs is inefficient and error-prone. For applications in data analysis, document migration, or archiving, automated text extraction ensures consistency, speed, and accuracy.

Solution Overview

Aspose.PDF Text Extractor for .NET provides a clean, programmable interface for extracting text in various formats. Choose between Pure, Raw, and Plain modes to fit your use case—whether you need formatted output, raw data, or plain content.


Prerequisites

  • Visual Studio 2019 or later
  • .NET 6.0 or later
  • Aspose.PDF for .NET installed via NuGet
PM> Install-Package Aspose.PDF

Step-by-Step Implementation

Step 1: Install and Configure Aspose.PDF

using Aspose.Pdf.Plugins;
using System.IO;

Step 2: Extract Text Using Default (Raw) Mode

using (var extractor = new TextExtractor())
{
    var options = new TextExtractorOptions(); // Raw mode by default
    options.AddInput(new FileDataSource("input.pdf"));
    var resultContainer = extractor.Process(options);
    string textExtracted = resultContainer.ResultCollection[0].ToString();
    Console.WriteLine(textExtracted);
}

Step 3: Extract Text in Pure or Plain Mode

  • Pure Mode: Preserves relative positions and adds spaces for alignment.
  • Plain Mode: Strips formatting, outputs text with minimal spaces.
using (var extractor = new TextExtractor())
{
    var options = new TextExtractorOptions(TextExtractorOptions.TextFormattingMode.Pure); // Or .Plain
    options.AddInput(new FileDataSource("input.pdf"));
    var resultContainer = extractor.Process(options);
    string textExtracted = resultContainer.ResultCollection[0].ToString();
    Console.WriteLine(textExtracted);
}

Use Cases & Applications (With Code Variations)

1. Batch Extract Text from Multiple PDFs

string[] files = Directory.GetFiles(@"C:\PDFs", "*.pdf");
using (var extractor = new TextExtractor())
{
    var options = new TextExtractorOptions(TextExtractorOptions.TextFormattingMode.Pure);
    foreach (var file in files)
        options.AddInput(new FileDataSource(file));
    var resultContainer = extractor.Process(options);
    for (int i = 0; i < resultContainer.ResultCollection.Count; i++)
    {
        string extracted = resultContainer.ResultCollection[i].ToString();
        // Save to disk, process, or analyze as needed
        File.WriteAllText($@"C:\PDFs\out\{Path.GetFileNameWithoutExtension(files[i])}.txt", extracted);
    }
}

2. Choose Extraction Mode Based on Use Case

  • Use Pure for table-like layouts or spatial formatting.
  • Use Plain for clean data extraction or analysis.
  • Use Raw for unprocessed text.

3. Post-process Extracted Text

After extraction, apply regex, text cleaning, or send results to other services (search, ML pipelines, etc.).

4. Integrate Extraction with Data Pipelines

Automate extraction as part of a broader ETL, reporting, or document management workflow using standard .NET practices.


Common Challenges and Solutions

Challenge: Inconsistent output due to complex PDF structure Solution: Try different extraction modes (Pure, Plain, Raw) and compare results. Post-process if needed.

Challenge: Batch extraction speed Solution: Use a single TextExtractor instance and process multiple files in one run for best performance.

Challenge: Special characters or encoding issues Solution: Use Plain mode for minimal formatting, then apply custom string processing as needed.


Performance and Best Practices

  • Test all three extraction modes to determine optimal results for your document type
  • Save original PDFs before batch operations
  • Handle output filenames and organization in batch jobs
  • Integrate error handling and logging for robustness

Complete Implementation Example

using Aspose.Pdf.Plugins;
using System;
using System.IO;

class Program
{
    static void Main()
    {
        using (var extractor = new TextExtractor())
        {
            var options = new TextExtractorOptions(TextExtractorOptions.TextFormattingMode.Plain);
            options.AddInput(new FileDataSource(@"C:\PDFs\input.pdf"));
            var resultContainer = extractor.Process(options);
            string textExtracted = resultContainer.ResultCollection[0].ToString();
            File.WriteAllText(@"C:\PDFs\output.txt", textExtracted);
        }
    }
}

Conclusion

Aspose.PDF Text Extractor for .NET gives you powerful, flexible tools for extracting text in multiple formats—fit for data processing, archiving, or analysis. Choose the extraction mode best suited to your needs and automate extraction for high efficiency in .NET applications.

 English