[英]Parsing large docx file in Java
I have a 200 page docx file that I need to parse. 我有一个需要解析的200页docx文件。 But the data I need is contained within the first 20 or so pages.
但是我需要的数据包含在前20个左右的页面中。 Does Apache POI have a way to retrieve just part of the document?
Apache POI是否可以只检索部分文档? It seems like the only way to get the data out of a docx file with Apache POI is using getParagraphs or getText(), and I don't really want an enormous String or List of paragraphs when I only need the first few pages.
似乎使用Apache POI从docx文件中获取数据的唯一方法是使用getParagraphs或getText(),而当我只需要前几页时,我并不需要真正的String或段落列表。 Any suggestions?
有什么建议么?
Since a *.docx
is simply a ZIP
archive we also could opening it as FileSystem gotten from FileSystems and then process its content totally independent from third party libraries. 由于
*.docx
仅仅是一个ZIP
我们还可以打开它的归档文件系统从得到的文件系统 ,然后再处理它的内容完全独立的第三方库。
This is a very basic example using StAX . 这是使用StAX的非常基本的示例。
import java.io.*;
import java.nio.file.*;
import javax.xml.stream.*;
import javax.xml.stream.events.*;
import javax.xml.namespace.QName;
public class UnZipAndReadOOXMLFileSystem {
public static void main (String args[]) throws Exception {
Path source = Paths.get("source.docx");
FileSystem fs = FileSystems.newFileSystem(source, null);
Path document = fs.getPath("/word/document.xml");
XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(Files.newInputStream(document));
StringBuffer content = new StringBuffer();
String contentSearched = "the content we are searching for";
boolean inParagraph = false;
String paragraphText = "";
while(reader.hasNext()) {
XMLEvent event = (XMLEvent)reader.next();
if(event.isStartElement()){
StartElement startElement = (StartElement)event;
QName startElementName = startElement.getName();
if(startElementName.getLocalPart().equalsIgnoreCase("p")) { //start element of paragraph
inParagraph = true;
content.append("<p>");
paragraphText = "";
}
} else if (event.isCharacters() && inParagraph) { //characters in elements of this paragraph
String characters = event.asCharacters().getData();
paragraphText += characters; // can be splitted into different run elements
} else if (event.isEndElement() && inParagraph) {
EndElement endElement = (EndElement)event;
QName endElementName = endElement.getName();
if(endElementName.getLocalPart().equalsIgnoreCase("p")) { //end element of paragraph
inParagraph = false;
content.append(paragraphText);
content.append("</p>\r\n");
//here you can check the paragraphText and exit the while if you found what you are searching for
if (paragraphText.contains(contentSearched)) break;
}
}
}
System.out.println(content);
fs.close();
}
}
Not possible with POI. POI无法实现。
If you want to read in a buffered mode, what you can do is convert your docx file to xml , and then read it line by line, extracting the text you need. 如果要以缓冲模式阅读,您可以做的是将docx文件转换为xml ,然后逐行阅读,提取所需的文本。 (pretty low level)
(相当低的水平)
docx files are zipped xml , you can open them with WinRar and inspect. docx文件是xml压缩文件,您可以使用WinRar打开它们并进行检查。
Doing this for a 200 pages file does not seem worth it unless you have very little memory. 除非您的内存很少,否则对200页的文件执行此操作似乎并不值得。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.