简体繁体中英

Is there any Plugin in apache Nutch to index both webHtml and pdfs in raw content

原文 2018-04-23 07:31:02 7 2 java/ solr/ hbase/ nutch

Is there any Plugin in apache Nutch to index both webHtml and pdfs with raw content.Such that formatting is not lost . and also can we crawl internal pdf link present in a html file using nutch ?

2 answers

For PDF there is nothing out of the box. Nutch uses Tika and tries to extract plain text. You could write your own plugin (using PDFBox for instance) and try to extract formatting information about the document.

Keep in mind that the raw content of a PDF file will not make a lot of sense. Probably you could try to convert your PDF to HTML/XML and then try to make sense of the structure. Perhaps a library such as: http://pdfx.cs.man.ac.uk/example would make sense for you. It's imposible to know without doing some experimentation.

About the "internal links" do you mean links in the same document or link to other documents/web pages inside the PDF of the content? If you mean internal links in the PDF, depending on the library you could probably do it.

Keep in mind that PDF is not an easy format to process. The Tika/PDFBox projects have donde an amazing job in easing this task and even with all the time/effort put into it, there are some edge files that are "problematic". Just a small warning 👍.

确保在您的nutch_site.xml属性中包含名为plugin.includes | parse-（text | html | pdf ）|的属性。

Apache nutch is not crawling any more

How to Create a nutch plugin that returns raw html to the parser

Apache Solr does not index scanned PDFs

How to test Apache Nutch plugin via some use cases

How to save fetched html content to database in apache nutch?

java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException when parsing with nutch

Apache Nutch - Problems with Paths

Apache Nutch Hadoop Integration

How can I extract raw text from PDFs using Apache POI?

Apache Nutch - NoSuchMethodError

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Apache nutch is not crawling any more How to Create a nutch plugin that returns raw html to the parser Apache Solr does not index scanned PDFs How to test Apache Nutch plugin via some use cases How to save fetched html content to database in apache nutch? java.lang.RuntimeException: org.apache.nutch.plugin.PluginRuntimeException: java.lang.ClassNotFoundException when parsing with nutch Apache Nutch - Problems with Paths Apache Nutch Hadoop Integration How can I extract raw text from PDFs using Apache POI? Apache Nutch - NoSuchMethodError

Related Tags

Is there any Plugin in apache Nutch to index both webHtml and pdfs in raw content

Question

2 answers

solution1
0 2018-04-23 10:44:52

solution2
0 2018-06-14 21:26:51

Is there any Plugin in apache Nutch to index both webHtml and pdfs in raw content

Question

2 answers

solution1 0 2018-04-23 10:44:52

solution2 0 2018-06-14 21:26:51

solution1
0 2018-04-23 10:44:52

solution2
0 2018-06-14 21:26:51