简体   繁体   English

用Java解析大型docx文件

[英]Parsing large docx file in Java

I have a 200 page docx file that I need to parse. 我有一个需要解析的200页docx文件。 But the data I need is contained within the first 20 or so pages. 但是我需要的数据包含在前20个左右的页面中。 Does Apache POI have a way to retrieve just part of the document? Apache POI是否可以只检索部分文档? It seems like the only way to get the data out of a docx file with Apache POI is using getParagraphs or getText(), and I don't really want an enormous String or List of paragraphs when I only need the first few pages. 似乎使用Apache POI从docx文件中获取数据的唯一方法是使用getParagraphs或getText(),而当我只需要前几页时,我并不需要真正的String或段落列表。 Any suggestions? 有什么建议么?

Since a *.docx is simply a ZIP archive we also could opening it as FileSystem gotten from FileSystems and then process its content totally independent from third party libraries. 由于*.docx仅仅是一个ZIP我们还可以打开它的归档文件系统从得到的文件系统 ,然后再处理它的内容完全独立的第三方库。

This is a very basic example using StAX . 这是使用StAX的非常基本的示例。

import java.io.*;
import java.nio.file.*;

import javax.xml.stream.*;
import javax.xml.stream.events.*;

import javax.xml.namespace.QName;

public class UnZipAndReadOOXMLFileSystem {

 public static void main (String args[]) throws Exception {

  Path source = Paths.get("source.docx");

  FileSystem fs = FileSystems.newFileSystem(source, null);

  Path document = fs.getPath("/word/document.xml");

  XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(Files.newInputStream(document));

  StringBuffer content = new StringBuffer();

  String contentSearched = "the content we are searching for";

  boolean inParagraph = false;
  String paragraphText = "";
  while(reader.hasNext()) {
   XMLEvent event = (XMLEvent)reader.next();
   if(event.isStartElement()){
    StartElement startElement = (StartElement)event;
    QName startElementName = startElement.getName();  
    if(startElementName.getLocalPart().equalsIgnoreCase("p")) { //start element of paragraph
     inParagraph = true;
     content.append("<p>");
     paragraphText = "";
    }
   } else if (event.isCharacters() && inParagraph) { //characters in elements of this paragraph
    String characters = event.asCharacters().getData();
    paragraphText += characters; // can be splitted into different run elements
   } else if (event.isEndElement() && inParagraph) {
    EndElement endElement = (EndElement)event;
    QName endElementName = endElement.getName();  
    if(endElementName.getLocalPart().equalsIgnoreCase("p")) { //end element of paragraph
     inParagraph = false;
     content.append(paragraphText);
     content.append("</p>\r\n");
     //here you can check the paragraphText and exit the while if you found what you are searching for
     if (paragraphText.contains(contentSearched)) break;
    }
   }
  }

  System.out.println(content);

  fs.close();

 }
}

Not possible with POI. POI无法实现。

If you want to read in a buffered mode, what you can do is convert your docx file to xml , and then read it line by line, extracting the text you need. 如果要以缓冲模式阅读,您可以做的是将docx文件转换为xml ,然后逐行阅读,提取所需的文本。 (pretty low level) (相当低的水平)

docx files are zipped xml , you can open them with WinRar and inspect. docx文件是xml压缩文件,您可以使用WinRar打开它们并进行检查。

Doing this for a 200 pages file does not seem worth it unless you have very little memory. 除非您的内存很少,否则对200页的文件执行此操作似乎并不值得。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM