使用 Apache Tika 從 PDF 中提取圖像

Question

Apache Tika 1.6 能夠從 PDF 文檔中提取內嵌圖像。 但是，我一直在努力讓它發揮作用。

我的用例是我想要一些代碼來提取內容並將圖像從任何文檔（不一定是 PDF）中分離出來。 然后將其傳遞到 Apache UIMA 管道中。

通過使用自定義解析器（構建在 AutoParser 上）將文檔轉換為 HTML，然后單獨保存圖像，我已經能夠從其他文檔類型中提取圖像。 但是，當我嘗試使用 PDF 時，標簽甚至不會出現在 HTML 中，讓我可以訪問這些文件。

有人可以建議我如何實現上述目標，最好是一些代碼示例，說明如何使用 Tika 1.6 從 PDF 中提取內聯圖像？

Answer 1

試試下面的代碼，ContentHandler 變成了你的 xml 內容。

public ContentHandler convertPdf(byte[] content, String path, String filename)throws IOException, SAXException, TikaException{           

    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    ContentHandler handler =   new ToXMLContentHandler();
    PDFParser parser = new PDFParser(); 

    PDFParserConfig config = new PDFParserConfig();
    config.setExtractInlineImages(true);
    config.setExtractUniqueInlineImagesOnly(true);

    parser.setPDFParserConfig(config);


    EmbeddedDocumentExtractor embeddedDocumentExtractor = 
            new EmbeddedDocumentExtractor() {
        @Override
        public boolean shouldParseEmbedded(Metadata metadata) {
            return true;
        }
        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                throws SAXException, IOException {
            Path outputFile = new File(path+metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
            Files.copy(stream, outputFile);
        }
    };

    context.set(PDFParser.class, parser);
    context.set(EmbeddedDocumentExtractor.class,embeddedDocumentExtractor );

    try (InputStream stream = new ByteArrayInputStream(content)) {
        parser.parse(stream, handler, metadata, context);
    }

    return handler;
}

Answer 2

可以使用AutoDetectParser提取圖像，而無需依賴PDFParser 。 此代碼同樣適用於從 docx、pptx 等中提取圖像。

這里我有一個parseDocument()和一個setPdfConfig()函數，它們使用了AutoDetectParser 。

我創建了一個AutoDetectParser
將EmbeddedDocumentExtractor附加到ParseContext 。
將AutoDetectParser附加到同一個ParseContext 。
附上PDFParserConfig到同一ParseContext 。
然后將該ParseContext給AutoDetectParser.parse() 。

圖像保存在與源文件相同位置的文件夾中，名稱為<sourceFile>_/ 。

private static void setPdfConfig(ParseContext context) {
    PDFParserConfig pdfConfig = new PDFParserConfig();
    pdfConfig.setExtractInlineImages(true);
    pdfConfig.setExtractUniqueInlineImagesOnly(true);

    context.set(PDFParserConfig.class, pdfConfig);
}

private static String parseDocument(String path) {
    String xhtmlContents = "";

    AutoDetectParser parser = new AutoDetectParser();
    ContentHandler handler = new ToXMLContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    EmbeddedDocumentExtractor embeddedDocumentExtractor = 
            new EmbeddedDocumentExtractor() {
        @Override
        public boolean shouldParseEmbedded(Metadata metadata) {
            return true;
        }
        @Override
        public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
                throws SAXException, IOException {
            Path outputDir = new File(path + "_").toPath();
            Files.createDirectories(outputDir);

            Path outputPath = new File(outputDir.toString() + "/" + metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
            Files.deleteIfExists(outputPath);
            Files.copy(stream, outputPath);
        }
    };

    context.set(EmbeddedDocumentExtractor.class, embeddedDocumentExtractor);
    context.set(AutoDetectParser.class, parser);

    setPdfConfig(context);

    try (InputStream stream = new FileInputStream(path)) {
        parser.parse(stream, handler, metadata, context);
        xhtmlContents = handler.toString();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException | TikaException e) {
        e.printStackTrace();
    }

    return xhtmlContents;
}

使用 Apache Tika 從 PDF 中提取圖像

問題描述

2 個解決方案

解決方案1
3 2017-11-24 11:53:59

解決方案2
3 2018-08-12 08:11:57

使用 Apache Tika 從 PDF 中提取圖像

問題描述

2 個解決方案

解決方案1 3 2017-11-24 11:53:59

解決方案2 3 2018-08-12 08:11:57

解決方案1
3 2017-11-24 11:53:59

解決方案2
3 2018-08-12 08:11:57