Jak przekonwertować zeskanowane pliki PDF na przeszukiwalne dokumenty tekstowe w .NET

Jak przekonwertować zeskanowane pliki PDF na przeszukiwalne dokumenty tekstowe w .NET

Scanned PDFs are often challenging to work with because they are essentially just images of text. Converting these images into searchable, editable text documents opens up a world of possibilities for document management and content accessibility. With Aspose.OCR for .NET, you can convert scanned PDFs into fully searchable documents while preserving the original images.

Why Convert Scanned PDFs into Searchable Text Documents?

  1. Accessibility:
    • Make scanned content searchable, making it easy to find information without manually reading through the document.
  2. Content Editing:
    • Once converted into text, the content can be edited, updated, or re-used in other formats.
  3. Efficiency:
    • Save time by automating the process of converting scanned PDFs into fully accessible text documents.

Prerequisites: Setting Up for Scanned PDF Text Extraction

Before extracting text from scanned PDFs, follow these steps to ensure everything is set up:

  1. Install Aspose.OCR for .NET:
    • Add Aspose.OCR to your project using NuGet:
      dotnet add package Aspose.OCR
  2. Obtain Your License:
    • Set up your metered license using SetMeteredKey() to unlock the full functionality of Aspose.OCR.
  3. Prepare Your Scanned PDF:
    • Ensure that the scanned PDFs are of good quality for better recognition accuracy.

Step-by-Step Guide: Converting Scanned PDFs to Searchable Text

Step 1: Set Up Your License

Start by configuring your Aspose.OCR license to unlock all features.

using Aspose.OCR;

Metered license = new Metered();
license.SetMeteredKey("<your public key>", "<your private key>");
Console.WriteLine("Licencja skonfigurowana pomyślnie.");

Step 2: Load the Scanned PDF into the OCR Input Object

Next, load the scanned PDF into the OcrInput object to begin the OCR process.

OcrInput input = new OcrInput(Aspose.OCR.InputType.PDF);
input.Add("scanned_document.pdf", 0, 3);  // Przetwórz pierwsze 3 strony
Console.WriteLine("Skanowany PDF załadowany pomyślnie.");

Step 3: Configure the OCR Engine for Recognition

Set up the OCR engine and configure any recognition settings, such as language and accuracy.

Aspose.OCR.AsposeOcr recognitionEngine = new Aspose.OCR.AsposeOcr();
RecognitionSettings settings = new RecognitionSettings();
settings.Language = Aspose.OCR.Language.Latin;  // Ustaw język OCR
Console.WriteLine("Silnik OCR skonfigurowany.");

Step 4: Extract and Output the Recognized Text

Now, extract the text from the scanned PDF using the OCR engine.

List<Aspose.OCR.RecognitionResult> results = recognitionEngine.Recognize(input, settings);
Console.WriteLine("Tekst pomyślnie wyodrębniony z zeskanowanego PDF.");

// Wyjście rozpoznanego tekstu
foreach (Aspose.OCR.RecognitionResult result in results)
{
    Console.WriteLine(result.RecognitionText);
}

// Zapisz wynik do pliku tekstowego
results[0].Save("recognized_text.txt", Aspose.OCR.SaveFormat.Text);
Console.WriteLine("Tekst zapisany do recognized_text.txt.");

Step 5: Test the Searchable PDF

Ensure that the extracted text is searchable and editable by testing the output in a PDF viewer or editor.

Common Issues and Fixes

1. Low OCR Accuracy

  • Solution: Ensure that the scanned PDF is of high quality (at least 300 DPI) to improve the recognition results.

2. Unsupported Fonts

  • Solution: Ensure the correct language is set in the OCR settings for accurate text recognition, especially for non-Latin characters.

3. Slow Performance for Large PDFs

  • Solution: For large PDFs, process the document in smaller chunks or pages to reduce memory usage and speed up the process.
 Polski