从使用Apache Nutch 1.4进行爬网和解析后获得的HTML文档中获取特定标签

Question

I used nutch 1.4 and crawled a website. 我使用了nutt 1.4，并爬行了一个网站。 I got the website crawled successfully and all the pages were dumped into segments. 我成功爬网了网站，所有页面都被分段了。 I merged all the segments to one segment and then i used readseg command to obtain a text version of all the crawled pages. 我将所有段合并为一个段，然后使用readseg命令获取所有已爬网页面的文本版本。 Now I need to find out, URL of page and the meta data stored in that page. 现在，我需要找出页面的URL和存储在该页面中的元数据。 I don't know which command to use or shall i need to do something different. 我不知道该使用哪个命令，或者我需要做一些其他的事情。

Have made a lot of efforts on google Some people said that you have to write a separate plugin for it. 在google上付出了很多努力有些人说您必须为此编写一个单独的插件。 Can someone tell me please. 有人可以告诉我。

Thanks a lot :) :) 非常感谢：）：）

Answer 1

Do a crawling. 做一个爬行。 After that enter this into terminal. 之后，将其输入终端。

bin/nutch readseg -dump crawl/segments/* output -nocontent -nofetch -nogenerate -noparse -noparsedata

If it runs, you will have a file with header informations plus contents in it. 如果运行，您将拥有一个包含标题信息和内容的文件。 After that you can easily modify the file and get whatever info you want by string operations. 之后，您可以轻松地修改文件并通过字符串操作获取所需的任何信息。

Answer 2

Finally, I am able to do it. 最后，我能够做到。 Sharing in case someone else needs it. 分享，以防他人需要。 You can use index-metatags plugin provided here: http://wiki.apache.org/nutch/IndexMetatags 您可以使用此处提供的index-metatags插件： http : //wiki.apache.org/nutch/IndexMetatags

It will solve this problem Cheers :) 它将解决这个问题干杯:)

从使用Apache Nutch 1.4进行爬网和解析后获得的HTML文档中获取特定标签

问题描述

2 个解决方案

解决方案1
0 2012-04-20 11:19:46

解决方案2
0 已采纳 2012-03-21 13:35:19

从使用Apache Nutch 1.4进行爬网和解析后获得的HTML文档中获取特定标签

问题描述

2 个解决方案

解决方案1 0 2012-04-20 11:19:46

解决方案2 0 已采纳 2012-03-21 13:35:19

解决方案1
0 2012-04-20 11:19:46

解决方案2
0 已采纳 2012-03-21 13:35:19