Text Extractor Plugin for Aspose.PDF

The Aspose.PDF Text Extractor Plugin for .NET enables developers to extract text content—structured, plain, or as-is—from PDF files. With three extraction modes, it’s ideal for document conversion, data mining, accessibility improvements, and more.

Latest Articles

Aspose.PDF Text Extractor Plugin Key Features

  1. Multiple Extraction Modes Extract text as pure (formatted), raw (as-is), or plain (cleaned) for maximum flexibility.

  2. Batch PDF Processing Add multiple PDFs for simultaneous extraction and streamlined workflows.

  3. Simple .NET Integration Straightforward API—add to any C# or .NET project for rapid deployment.


Getting Started with Aspose.PDF Text Extractor Plugin

  1. Install Aspose.PDF for .NET Add via NuGet or download assemblies to your .NET solution.

  2. Configure Your License Activate for unrestricted processing and support.

  3. Configure Extraction Options Use TextExtractor and TextExtractorOptions classes. Set extraction mode as desired (Pure, Raw, Plain).

  4. Process and Retrieve Text Run extraction and access results through the result container collection.


Example: Extract Text from a PDF (C#)

using Aspose.Pdf.Plugins;

var extractor = new TextExtractor();
var options = new TextExtractorOptions(TextExtractorOptions.TextFormattingMode.Pure);
options.AddInput(new FileDataSource(@"C:\Samples\sample.pdf"));
var resultContainer = extractor.Process(options);
string extractedText = resultContainer.ResultCollection[0].ToString();
Console.WriteLine(extractedText);

Example: Batch Extract Text from Multiple PDFs

string[] pdfFiles = { "sample1.pdf", "sample2.pdf" };
var extractor = new TextExtractor();
var options = new TextExtractorOptions(TextExtractorOptions.TextFormattingMode.Raw);
foreach (var file in pdfFiles)
{
    options.AddInput(new FileDataSource(file));
}
var resultContainer = extractor.Process(options);
for (int i = 0; i < resultContainer.ResultCollection.Count; i++)
{
    string text = resultContainer.ResultCollection[i].ToString();
    Console.WriteLine(text);
}

Use Cases & Extensions

  • PDF to TXT Conversion: Automate conversion of PDFs to plain text for indexing, search, or archival.
  • Data Mining: Extract table data, invoices, or forms for further processing or analytics.
  • Accessibility: Prepare readable content for screen readers or alternate formats.
  • Batch Processing: Use extraction modes for specific downstream workflows (e.g., OCR pre-processing, entity recognition).

For advanced extraction—such as handling encrypted PDFs, or customizing text output—refer to the official API Reference.


Best Practices

  • Always select the extraction mode that matches your output needs (formatting, raw, or clean).
  • For large document sets, batch process to maximize throughput and minimize manual effort.
  • Test extraction results with real-world PDFs to ensure data accuracy.

Related Resources:


 English