How to Extract Text, Images, and Metadata from Word Documents in .NET
Extracting text, images, and metadata from Word documents is essential for document analysis and processing. With Aspose.Words for .NET, developers can programmatically retrieve document content and properties for various use cases, such as indexing, archiving, or content transformation.
Prerequisites
- Install the .NET SDK.
- Add the Aspose.Words NuGet package:
dotnet add package Aspose.Words
- Prepare a Word document (
document.docx
) with text, images, and metadata.
Step-by-Step Guide to Extract Content from Word Files
1. Load the Word Document
using System;
using Aspose.Words;
class Program
{
static void Main()
{
// Step 1: Load the Word document
string filePath = "document.docx";
Document doc = new Document(filePath);
// Steps 2, 3, and 4 will be added below
}
}
Explanation: This code loads the specified Word document into memory for further processing.
2. Extract Text from the Document
using System;
using Aspose.Words;
class Program
{
static void Main()
{
string filePath = "document.docx";
Document doc = new Document(filePath);
// Step 2: Extract Text
string text = doc.GetText();
Console.WriteLine("Extracted Text: " + text);
// Steps 3 and 4 will be added below
}
}
Explanation: This code extracts all the text content from the loaded Word document and prints it to the console.
3. Extract Metadata from the Document
using System;
using Aspose.Words;
class Program
{
static void Main()
{
string filePath = "document.docx";
Document doc = new Document(filePath);
string text = doc.GetText();
Console.WriteLine("Extracted Text: " + text);
// Step 3: Extract Metadata
Console.WriteLine("Title: " + doc.BuiltInDocumentProperties.Title);
Console.WriteLine("Author: " + doc.BuiltInDocumentProperties.Author);
Console.WriteLine("Created Date: " + doc.BuiltInDocumentProperties.CreatedTime);
// Step 4 will be added below
}
}
Explanation: This code extracts and prints the title, author, and creation date metadata from the Word document.
4. Extract Images from the Document
using System;
using Aspose.Words;
class Program
{
static void Main()
{
string filePath = "document.docx";
Document doc = new Document(filePath);
string text = doc.GetText();
Console.WriteLine("Extracted Text: " + text);
Console.WriteLine("Title: " + doc.BuiltInDocumentProperties.Title);
Console.WriteLine("Author: " + doc.BuiltInDocumentProperties.Author);
Console.WriteLine("Created Date: " + doc.BuiltInDocumentProperties.CreatedTime);
// Step 4: Extract Images
int imageCount = 0;
foreach (var shape in doc.GetChildNodes(NodeType.Shape, true))
{
if (shape is Shape { HasImage: true } imageShape)
{
string imageFilePath = $"Image_{++imageCount}.png";
imageShape.ImageData.Save(imageFilePath);
Console.WriteLine($"Saved Image: {imageFilePath}");
}
}
Console.WriteLine("Content extraction completed.");
}
}
Explanation: This code extracts all images from the Word document and saves them as PNG files in the project directory.
5. Test the Solution
- Ensure
document.docx
is in the project directory. - Run the program and verify:
- Extracted text in the console output.
- Metadata details printed.
- Extracted images saved in the project folder.
How to Deploy and Run on Major Platforms
Windows
- Install the .NET runtime and deploy the application.
- Test the application by running it via the command line.
Linux
- Install the .NET runtime.
- Use terminal commands to execute the application or host it on a server.
macOS
- Run the application using Kestrel or deploy it on a cloud service.
Common Issues and Fixes
- Images Not Extracted:
- Ensure the document contains embedded images and not externally linked ones.
- Metadata Missing:
- Verify that the document has metadata properties like Title or Author set.
- Large File Processing:
- Use a memory-efficient approach, such as processing specific sections of the document.
With this guide, you can programmatically extract valuable content from Word documents using Aspose.Words for .NET.