简体   繁体   中英

Is there any Plugin in apache Nutch to index both webHtml and pdfs in raw content

Is there any Plugin in apache Nutch to index both webHtml and pdfs with raw content.Such that formatting is not lost . and also can we crawl internal pdf link present in a html file using nutch ?

For PDF there is nothing out of the box. Nutch uses Tika and tries to extract plain text. You could write your own plugin (using PDFBox for instance) and try to extract formatting information about the document.

Keep in mind that the raw content of a PDF file will not make a lot of sense. Probably you could try to convert your PDF to HTML/XML and then try to make sense of the structure. Perhaps a library such as: http://pdfx.cs.man.ac.uk/example would make sense for you. It's imposible to know without doing some experimentation.

About the "internal links" do you mean links in the same document or link to other documents/web pages inside the PDF of the content? If you mean internal links in the PDF, depending on the library you could probably do it.

Keep in mind that PDF is not an easy format to process. The Tika/PDFBox projects have donde an amazing job in easing this task and even with all the time/effort put into it, there are some edge files that are "problematic". Just a small warning 👍.

确保在您的nutch_site.xml属性中包含名为plugin.includes | parse-(text | html | pdf )|的属性。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM