讀取 java 中的一個單詞 (.docx) 文件

Question

我有一個用docx4j生成的word文檔，當我解壓docx文件時，文件夾的內容是

./word/document.xml的內容如下

關系 xml 具有以下關系

<Relationship Target="../chunk.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId3"/>

當我們解壓縮 chunk.docx 時，它具有以下文件內容

和 ./word/document.xml 有以下內容

關系文件 xml 有以下內容

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Target="styles.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Id="rId1"/>
<Relationship Target="settings.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Id="rId2"/>
<Relationship Target="../chunk.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId3"/>
<Relationship Target="../chunk2.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId4"/>
<Relationship Target="../chunk3.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId5"/>
<Relationship Target="../chunk4.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId6"/>
<Relationship Target="../chunk5.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId7"/>

同樣，當我解壓縮 chunk.docx 時，它具有以下文件內容

和./word/document.xml 有以下內容

如何通過java代碼讀取word文檔的內容

我嘗試過以下方式

File docxFile = new File(filePath);
        WordprocessingMLPackage wordprocessingMLPackage = WordprocessingMLPackage.load(docxFile);
        MainDocumentPart mainDocumentPart = wordprocessingMLPackage.getMainDocumentPart();
        List<Object> textNodes = mainDocumentPart.getJAXBNodesViaXPath(TEXT_NODEX_XPATH, true);

但它給出了 0 個文本節點，誰能幫助我如何使用 java 閱讀這種類型的單詞 docx

Answer 1

您的 docx 包含 docx 類型的 altChunks。

它包含那些，因為當創建它的人使用 docx4j 使用諸如https://github.com/plutext/docx4j/blob/VERSION_11_4_7/docx4j-samples-docx4j/src/main/之類的代碼時，它會明確地完成java/org/docx4j/samples/AltChunkAddOfTypeDocx.java

通常你不會那樣做。

通常，如果您想使用 XPath 之類的方法處理這樣的 docx，您首先需要將這些 altChunk 轉換為普通內容。 Word 可以做到這一點，Docx4j Enterprise 也可以。

但是，如果您控制生成應用程序，最好的方法是重新訪問它，更改它以使其不會創建 altChunks。 至少理解他們為什么這樣寫。

Answer 2

我在使用 apache POI 解析字符串 from.docx 時也發現了類似的問題，您可以使用MAMMOTH庫。 這是我使用的代碼https://stackoverflow.com/a/73373053/9430422

讀取 java 中的一個單詞 (.docx) 文件

問題描述

1 個解決方案

解決方案1
1 已采納 2022-06-01 22:40:47

解決方案2
0 2022-08-16 11:22:49

讀取 java 中的一個單詞 (.docx) 文件

問題描述

1 個解決方案

解決方案1 1 已采納 2022-06-01 22:40:47

解決方案2 0 2022-08-16 11:22:49

解決方案1
1 已采納 2022-06-01 22:40:47

解決方案2
0 2022-08-16 11:22:49