如何在Java中使用Apache Tika從PDF文件獲取頁眉和頁腳

Question

我正在使用apache tika來抓取pdf文件中的內容。抓取的內容（文本）還包含頁眉和頁腳。我的要求是獲取不包含頁眉和頁腳的文本。以下是我抓取內容的示例代碼。 樣例代碼：

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Date;
import java.util.List;
import java.util.Set;
import java.util.TreeMap;
import org.apache.commons.io.FileUtils;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.json.simple.JSONObject;

public class test {

    public static void main(String[] args) throws Exception {

            String file = "C://Sample.pdf";
            File file1 = new File(file);
            InputStream input = new FileInputStream(file1);
            Metadata metadata = new Metadata();
            BodyContentHandler handler = new BodyContentHandler(
                    10 * 1024 * 1024);
            AutoDetectParser parser = new AutoDetectParser();
            parser.parse(input, handler, metadata);
            String path = "C://AUG7th".concat("/").concat(file1.getName())
                    .concat(".txt");
            String content = handler.toString();
            File file2 = new File(path);
            FileWriter fw = new FileWriter(file2.getAbsoluteFile());
            BufferedWriter bw = new BufferedWriter(fw);
            bw.write(content);
            bw.close();

    }

}

如何做到這一點，請建議我。 謝謝

Answer 1

我還沒有找到使用Tika解析pdf標題或頁腳的方法。 您需要另一個api來執行此操作，例如PDFTextSTream 。

編輯： OK .. Tika將（嘗試）從pdf中提取原始文本和元數據。
您需要分析和分析原始文本才能刪除標題和頁腳。 我建議使用PDFTextStream而不是Tika，因為它可以簡化為此目的實現算法的任務。 當您使用PDFTextStream解析pdf時，您可以提取不是簡單字符的TextUnit，但它們也會“攜帶”其他信息。 您還可以選擇文本區域，此外還可以選擇保持每個頁面的視覺布局。

@Gagravarr XHTML輸出pdf

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
**<head>**
<meta name="dcterms:modified" content="2012-11-21T16:08:42Z"/>
<meta name="meta:creation-date" content="2010-06-22T07:00:09Z"/>
<meta name="meta:save-date" content="2012-11-21T16:08:42Z"/>
<meta name="Content-Length" content="702419"/>
<meta name="Last-Modified" content="2012-11-21T16:08:42Z"/>
<meta name="dcterms:created" content="2010-06-22T07:00:09Z"/>
<meta name="date" content="2012-11-21T16:08:42Z"/>
<meta name="modified" content="2012-11-21T16:08:42Z"/>
<meta name="xmpTPg:NPages" content="20"/>
<meta name="Creation-Date" content="2010-06-22T07:00:09Z"/>
<meta name="created" content="Tue Jun 22 09:00:09 CEST 2010"/>
<meta name="producer" content="Atypon Systems, Inc."/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="xmp:CreatorTool" content="PDFplus"/>
<meta name="resourceName" content="Lessons from a High-Impact Observatory The Hubble Space Telescope.pdf"/>
<meta name="Last-Save-Date" content="2012-11-21T16:08:42Z"/>
<meta name="dc:title" content="Lessons from a High-Impact Observatory: The &lt;italic&gt;Hubble Space Telescopes&lt;/italic&gt; Science Productivity between 1998 and 2008"/>
<title>Lessons from a High-Impact Observatory: The &lt;italic&gt;Hubble Space Telescopes&lt;/italic&gt; Science Productivity between 1998 and 2008</title>
**</head>**
**<body>**<div class="page"><p/>
<p>Lessons from a High-Impact Observatory: The Hubble Space Telescope’s Science Productivity
between 1998 and 2008
Author(s): Dániel Apai, Jill Lagerstrom, Iain Neill Reid, Karen L. Levay, Elizabeth Fraser,
Antonella Nota, and Edwin Henneken
Reviewed work(s):
Source: Publications of the Astronomical Society of the Pacific, Vol. 122, No. 893 (July 2010),
pp. 808-826
Published by: The University of Chicago Press on behalf of the Astronomical Society of the Pacific
Stable URL: http://www.jstor.org/stable/10.1086/654851 .
Accessed: 21/11/2012 11:08
</p>
<p>Your use of the JSTOR archive indicates your acceptance of the Terms &amp; Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
</p>
<p> .
</p>
<p>JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
</p>................**</body>**

在頭提卡給了我們，它發現的元數據，並在身體它使我們在划分段落文本（似乎有點笨拙太），也可以給我們注釋鏈接。 所以，我認為它不是很有幫助。

如何在Java中使用Apache Tika從PDF文件獲取頁眉和頁腳

問題描述

1 個解決方案

解決方案1
0 2014-02-18 08:56:59

如何在Java中使用Apache Tika從PDF文件獲取頁眉和頁腳

問題描述

1 個解決方案

解決方案1 0 2014-02-18 08:56:59

解決方案1
0 2014-02-18 08:56:59