简体   繁体   English

如何在Java中使用Apache Tika从PDF文件获取页眉和页脚

[英]How to get Header and Footer from PDF file using apache tika in java

I am using apache tika to crawl the content from the pdf file.The crawled content(text) contains headers and footers also.My requirement is to get the text without headers and footers.Below is my sample code to crawl the content. 我正在使用apache tika来抓取pdf文件中的内容。抓取的内容(文本)还包含页眉和页脚。我的要求是获取不包含页眉和页脚的文本。以下是我抓取内容的示例代码。 Sample Code: 样例代码:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Date;
import java.util.List;
import java.util.Set;
import java.util.TreeMap;
import org.apache.commons.io.FileUtils;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.json.simple.JSONObject;

public class test {

    public static void main(String[] args) throws Exception {

            String file = "C://Sample.pdf";
            File file1 = new File(file);
            InputStream input = new FileInputStream(file1);
            Metadata metadata = new Metadata();
            BodyContentHandler handler = new BodyContentHandler(
                    10 * 1024 * 1024);
            AutoDetectParser parser = new AutoDetectParser();
            parser.parse(input, handler, metadata);
            String path = "C://AUG7th".concat("/").concat(file1.getName())
                    .concat(".txt");
            String content = handler.toString();
            File file2 = new File(path);
            FileWriter fw = new FileWriter(file2.getAbsoluteFile());
            BufferedWriter bw = new BufferedWriter(fw);
            bw.write(content);
            bw.close();

    }

}

How to do this please suggest me. 如何做到这一点,请建议我。 Thanks 谢谢

I havent found a way to parse headings or footer of a pdf using Tika. 我还没有找到使用Tika解析pdf标题或页脚的方法。 You need another api to do that such as PDFTextSTream . 您需要另一个api来执行此操作,例如PDFTextSTream

EDIT: OK.. Tika will (try to) extract raw text and metadata from the pdf. 编辑: OK .. Tika将(尝试)从pdf中提取原始文本和元数据。
You need to parse and analyze the raw text in order to delete headings and footers. 您需要分析和分析原始文本才能删除标题和页脚。 I suggested PDFTextStream rather than Tika because it will simplify the task of implementing an algorithm for this purpose. 我建议使用PDFTextStream而不是Tika,因为它可以简化为此目的实现算法的任务。 When you parse a pdf with PDFTextStream you can extract TextUnits that are not simple characters but they "carry" other information too. 当您使用PDFTextStream解析pdf时,您可以提取不是简单字符的TextUnit,但它们也会“携带”其他信息。 You also have the ability to select a region of text and in addition it gives you the choice of maintaining the visual layout of each page. 您还可以选择文本区域,此外还可以选择保持每个页面的视觉布局。

@Gagravarr XHTML output of a pdf @Gagravarr XHTML输出pdf

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
**<head>**
<meta name="dcterms:modified" content="2012-11-21T16:08:42Z"/>
<meta name="meta:creation-date" content="2010-06-22T07:00:09Z"/>
<meta name="meta:save-date" content="2012-11-21T16:08:42Z"/>
<meta name="Content-Length" content="702419"/>
<meta name="Last-Modified" content="2012-11-21T16:08:42Z"/>
<meta name="dcterms:created" content="2010-06-22T07:00:09Z"/>
<meta name="date" content="2012-11-21T16:08:42Z"/>
<meta name="modified" content="2012-11-21T16:08:42Z"/>
<meta name="xmpTPg:NPages" content="20"/>
<meta name="Creation-Date" content="2010-06-22T07:00:09Z"/>
<meta name="created" content="Tue Jun 22 09:00:09 CEST 2010"/>
<meta name="producer" content="Atypon Systems, Inc."/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="xmp:CreatorTool" content="PDFplus"/>
<meta name="resourceName" content="Lessons from a High-Impact Observatory The Hubble Space Telescope.pdf"/>
<meta name="Last-Save-Date" content="2012-11-21T16:08:42Z"/>
<meta name="dc:title" content="Lessons from a High-Impact Observatory: The &lt;italic&gt;Hubble Space Telescopes&lt;/italic&gt; Science Productivity between 1998 and 2008"/>
<title>Lessons from a High-Impact Observatory: The &lt;italic&gt;Hubble Space Telescopes&lt;/italic&gt; Science Productivity between 1998 and 2008</title>
**</head>**
**<body>**<div class="page"><p/>
<p>Lessons from a High-Impact Observatory: The Hubble Space Telescope’s Science Productivity
between 1998 and 2008
Author(s): Dániel Apai, Jill Lagerstrom, Iain Neill Reid, Karen L. Levay, Elizabeth Fraser,
Antonella Nota, and Edwin Henneken
Reviewed work(s):
Source: Publications of the Astronomical Society of the Pacific, Vol. 122, No. 893 (July 2010),
pp. 808-826
Published by: The University of Chicago Press on behalf of the Astronomical Society of the Pacific
Stable URL: http://www.jstor.org/stable/10.1086/654851 .
Accessed: 21/11/2012 11:08
</p>
<p>Your use of the JSTOR archive indicates your acceptance of the Terms &amp; Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
</p>
<p> .
</p>
<p>JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
</p>................**</body>**

In head Tika gives us the metadata that it found, and in body it gives us the text divided in paragraphs(seems a bit clumsy too) and it also can give us annotation links. 提卡给了我们,它发现的元数据,并在身体它使我们在划分段落文本(似乎有点笨拙太),也可以给我们注释链接。 So, i don't think its very helpful. 所以,我认为它不是很有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM