简体   繁体   English

apache Nutch中是否有任何插件可以对原始内容中的webHtml和pdf进行索引

[英]Is there any Plugin in apache Nutch to index both webHtml and pdfs in raw content

Is there any Plugin in apache Nutch to index both webHtml and pdfs with raw content.Such that formatting is not lost . 在Apache Nutch中是否有任何插件可以对原始内容的webHtml和pdf进行索引。这样的格式不会丢失。 and also can we crawl internal pdf link present in a html file using nutch ? 我们也可以使用nutch抓取html文件中存在的内部pdf链接吗?

For PDF there is nothing out of the box. 对于PDF,没有任何现成的东西。 Nutch uses Tika and tries to extract plain text. Nutch使用Tika并尝试提取纯文本。 You could write your own plugin (using PDFBox for instance) and try to extract formatting information about the document. 您可以编写自己的插件(例如,使用PDFBox ),然后尝试提取有关文档的格式信息。

Keep in mind that the raw content of a PDF file will not make a lot of sense. 请记住,PDF文件的原始内容没有多大意义。 Probably you could try to convert your PDF to HTML/XML and then try to make sense of the structure. 可能您可以尝试将PDF转换为HTML / XML,然后尝试理解其结构。 Perhaps a library such as: http://pdfx.cs.man.ac.uk/example would make sense for you. 也许诸如http://pdfx.cs.man.ac.uk/example之类的库对您有意义。 It's imposible to know without doing some experimentation. 不做一些试验就不可能知道。

About the "internal links" do you mean links in the same document or link to other documents/web pages inside the PDF of the content? 关于“内部链接”,您是指同一文档中的链接,还是指向内容PDF内的其他文档/网页的链接? If you mean internal links in the PDF, depending on the library you could probably do it. 如果您指的是PDF中的内部链接,则可能可以根据库进行操作。

Keep in mind that PDF is not an easy format to process. 请记住,PDF并非易于处理的格式。 The Tika/PDFBox projects have donde an amazing job in easing this task and even with all the time/effort put into it, there are some edge files that are "problematic". Tika / PDFBox项目在简化此任务方面做得非常出色,即使花了很多时间/精力,但仍有一些“问题”边缘文件。 Just a small warning 👍. 只是一个小警告👍。

确保在您的nutch_site.xml属性中包含名为plugin.includes | parse-(text | html | pdf )|的属性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM