简体   繁体   English

从使用Apache Nutch 1.4进行爬网和解析后获得的HTML文档中获取特定标签

[英]Fetch particular tags from HTML docs obtained after crawling and parsing using Apache Nutch 1.4

I used nutch 1.4 and crawled a website. 我使用了nutt 1.4,并爬行了一个网站。 I got the website crawled successfully and all the pages were dumped into segments. 我成功爬网了网站,所有页面都被分段了。 I merged all the segments to one segment and then i used readseg command to obtain a text version of all the crawled pages. 我将所有段合并为一个段,然后使用readseg命令获取所有已爬网页面的文本版本。 Now I need to find out, URL of page and the meta data stored in that page. 现在,我需要找出页面的URL和存储在该页面中的元数据。 I don't know which command to use or shall i need to do something different. 我不知道该使用哪个命令,或者我需要做一些其他的事情。

Have made a lot of efforts on google Some people said that you have to write a separate plugin for it. 在google上付出了很多努力有些人说您必须为此编写一个单独的插件。 Can someone tell me please. 有人可以告诉我。

Thanks a lot :) :) 非常感谢 :) :)

Do a crawling. 做一个爬行。 After that enter this into terminal. 之后,将其输入终端。

bin/nutch readseg -dump crawl/segments/* output -nocontent -nofetch -nogenerate -noparse -noparsedata

If it runs, you will have a file with header informations plus contents in it. 如果运行,您将拥有一个包含标题信息和内容的文件。 After that you can easily modify the file and get whatever info you want by string operations. 之后,您可以轻松地修改文件并通过字符串操作获取所需的任何信息。

Finally, I am able to do it. 最后,我能够做到。 Sharing in case someone else needs it. 分享,以防他人需要。 You can use index-metatags plugin provided here: http://wiki.apache.org/nutch/IndexMetatags 您可以使用此处提供的index-metatags插件: http : //wiki.apache.org/nutch/IndexMetatags

It will solve this problem Cheers :) 它将解决这个问题干杯:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM