如何使用 Apache POI for PowerPoint 读取 XSLFGraphicFrame 中的文本

Question

我正在制作一个 Java 程序来查找文档中特定关键字的出现。 我想阅读多种类型的文件格式，包括所有 Microsoft Office 文档。

除了 PowerPoint 之外，我已经使用了所有这些代码，我正在使用 StackOverflow 或其他来源上的 Apache POI 代码片段。 我发现所有幻灯片都是由形状 (XSLFTextShape) 组成的，但其中许多是 XSLFGraphicFrame 或 XSLFTable 类的对象，我不能简单地使用 toString() 方法。 如何使用 Java 提取其中包含的所有文本。 这是一段代码\\伪代码：

File f = new File("C:\\Users\\Windows\\Desktop\\Modulo 9.pptx");
PrintStream out = System.out;

FileInputStream is = new FileInputStream(f);
XMLSlideShow ppt = new XMLSlideShow(is);
for (XSLFSlide slide : ppt.getSlides()) {
    for (XSLFShape shape : slide) {
       if (shape instanceof XSLFTextShape) {
       XSLFTextShape txShape = (XSLFTextShape) shape;
       out.println(txShape.getText());
       } else if (shape instanceof XSLFPictureShape) {
        //do nothing
       } else if (shape instanceof XSLFGraphicFrame or XSLFTable ) {
       //print all text in it or in its children
       }
    }
}

Answer 1

如果您的要求“在文档中查找特定关键字的出现”需要简单地搜索SlideShows所有文本内容，那么只需使用SlideShowExtractor可能是一种方法。 这也可以作为POITextExtractor 的入口点，用于获取文档元数据/属性的文本内容，例如作者和标题。

例子：

import java.io.FileInputStream;

import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.SlideShow;
import org.apache.poi.sl.extractor.SlideShowExtractor;

import org.apache.poi.extractor.POITextExtractor;

public class SlideShowExtractorExample {

 public static void main(String[] args) throws Exception {

  SlideShow<XSLFShape,XSLFTextParagraph> slideshow 
   = new XMLSlideShow(new FileInputStream("Performance_Out.pptx"));

  SlideShowExtractor<XSLFShape,XSLFTextParagraph> slideShowExtractor 
   = new SlideShowExtractor<XSLFShape,XSLFTextParagraph>(slideshow);
  slideShowExtractor.setCommentsByDefault(true);
  slideShowExtractor.setMasterByDefault(true);
  slideShowExtractor.setNotesByDefault(true);

  String allTextContentInSlideShow = slideShowExtractor.getText();

System.out.println(allTextContentInSlideShow);

System.out.println("===========================================================================");

  POITextExtractor textExtractor = slideShowExtractor.getMetadataTextExtractor();
  String metaData = textExtractor.getText();

System.out.println(metaData);

 }
}

当然，有一些XSLFGraphicFrame不能被SlideShowExtractor读取，因为它们直到现在apache poi都不支持。 例如各种SmartArt 图形。 这些文本内容存储在幻灯片中引用的/ppt/diagrams/data*.xml文档部分中。 由于apache poi直到现在都不支持它，因此只能使用低级底层方法读取它。

例如，要另外从所有 /ppt/diagrams/data 中获取所有文本，这些文本是SmartArt图形中的文本，我们可以这样做：

...
System.out.println("===========================================================================");

//additionally get all text out of all /ppt/diagrams/data which are texts in SmartArt graphics:
  StringBuilder sb = new StringBuilder();
  for (XSLFSlide slide : ((XMLSlideShow)slideshow).getSlides()) {
   for (org.apache.poi.ooxml.POIXMLDocumentPart part : slide.getRelations()) {
    if (part.getPackagePart().getPartName().getName().startsWith("/ppt/diagrams/data")) {
     org.apache.xmlbeans.XmlObject xmlObject = org.apache.xmlbeans.XmlObject.Factory.parse(part.getPackagePart().getInputStream());
     org.apache.xmlbeans.XmlCursor cursor = xmlObject.newCursor();
     while(cursor.hasNextToken()) {
      if (cursor.isText()) {
       sb.append(cursor.getTextValue() + "\r\n");
      }
      cursor.toNextToken();
     }
     sb.append(slide.getSlideNumber() + "\r\n\r\n");
    }
   }
  }
  String allTextContentInDiagrams = sb.toString();

System.out.println(allTextContentInDiagrams);
...

如何使用 Apache POI for PowerPoint 读取 XSLFGraphicFrame 中的文本

问题描述

1 个解决方案

解决方案1
1 2018-12-31 05:39:59

如何使用 Apache POI for PowerPoint 读取 XSLFGraphicFrame 中的文本

问题描述

1 个解决方案

解决方案1 1 2018-12-31 05:39:59

解决方案1
1 2018-12-31 05:39:59