Extract Media from Word Documents

# .NETでWord文書からテキスト、画像、およびメタデータを抽出する方法

Word ドキュメントからテキスト、画像、メタデータを抽出することは、ドキュメント分析および処理に不可欠です Aspose.Words for .NET では、開発者は、インデックス、アーカイブ、またはコンテンツ変換などのさまざまな用途のためのドキュメントコンテンツおよび属性をプログラム的に取得することができます。

原則

インストール → ネット SDK .
Aspose.Words NuGet パッケージを追加する:dotnet add package Aspose.Words
文書を作成する(document.docx) テキスト、画像、メタデータ。

Word ファイルからコンテンツを抽出するためのステップ・ステップガイド

1.単語文書をアップロード

using System;
using Aspose.Words;

class Program
{
    static void Main()
    {
        // Step 1: Load the Word document
        string filePath = "document.docx";
        Document doc = new Document(filePath);

        // Steps 2, 3, and 4 will be added below
    }
}

説明: このコードは、追加処理のために指定されたWord文書をメモリにアップロードします。

2.文書からテキストを抽出する

using System;
using Aspose.Words;

class Program
{
    static void Main()
    {
        string filePath = "document.docx";
        Document doc = new Document(filePath);

        // Step 2: Extract Text
        string text = doc.GetText();
        Console.WriteLine("Extracted Text: " + text);

        // Steps 3 and 4 will be added below
    }
}

説明: このコードは、充電されたWord文書からすべてのテキストコンテンツを抽出し、コンソールに印刷します。

3.文書からメタデータを抽出する

using System;
using Aspose.Words;

class Program
{
    static void Main()
    {
        string filePath = "document.docx";
        Document doc = new Document(filePath);

        string text = doc.GetText();
        Console.WriteLine("Extracted Text: " + text);

        // Step 3: Extract Metadata
        Console.WriteLine("Title: " + doc.BuiltInDocumentProperties.Title);
        Console.WriteLine("Author: " + doc.BuiltInDocumentProperties.Author);
        Console.WriteLine("Created Date: " + doc.BuiltInDocumentProperties.CreatedTime);

        // Step 4 will be added below
    }
}

説明: このコードは、Word ドキュメントからタイトル、著者、作成日付のメタデータを抽出および印刷します。

4.文書から画像を抽出する

using System;
using Aspose.Words;

class Program
{
    static void Main()
    {
        string filePath = "document.docx";
        Document doc = new Document(filePath);

        string text = doc.GetText();
        Console.WriteLine("Extracted Text: " + text);

        Console.WriteLine("Title: " + doc.BuiltInDocumentProperties.Title);
        Console.WriteLine("Author: " + doc.BuiltInDocumentProperties.Author);
        Console.WriteLine("Created Date: " + doc.BuiltInDocumentProperties.CreatedTime);

        // Step 4: Extract Images
        int imageCount = 0;
        foreach (var shape in doc.GetChildNodes(NodeType.Shape, true))
        {
            if (shape is Shape { HasImage: true } imageShape)
            {
                string imageFilePath = $"Image_{++imageCount}.png";
                imageShape.ImageData.Save(imageFilePath);
                Console.WriteLine($"Saved Image: {imageFilePath}");
            }
        }

        Console.WriteLine("Content extraction completed.");
    }
}

説明: このコードは、Word ドキュメントからすべての画像を抽出し、プロジェクトディレクトリに PNG ファイルとして保存します。

5.解決策を試す

保証 document.docx プロジェクトディレクトリにあります。
プログラムを実行して確認する:- コンソール出力に抽出されたテキスト。
メタデータ印刷
プロジェクトフォルダーに保存された画像を抽出します。

大規模なプラットフォームで実行・実行する方法

・Windows

.NET の実行時間をインストールし、アプリケーションを実行します。
アプリケーションをコマンドラインで実行することによってテストします。

Linux について

.NET Runtime をインストールします。
ターミナルコマンドを使用してアプリを実行したり、サーバーにホストしたりします。

マコス

Kestrel を使用してアプリを実行するか、クラウドサービスで実行するか。

共通の問題と修正

画像が表示されていない:- 文書には内蔵された画像が含まれており、外部にリンクされていない画像が含まれていることを確認します。
メタデータが欠けている:- 文書にはタイトルまたは著者セットのようなメタデータ属性があることを確認します。
長いファイル処理:- メモリ効率的なアプローチを使用し、例えば文書の特定のセクションを処理します。

このガイドを使用すると、Aspose.Words for .NET を使用して、Word ドキュメントから貴重なコンテンツをプログラム的に抽出できます。