简体繁体中英

Apache nutch is not crawling any more

原文 2014-11-24 04:33:02 3 2 java/ hadoop/ hbase/ web-crawler/ nutch

I have a two machine cluster. On one machine nutch is configured and on second hbase and hadoop are configured. hadoop is in fully distributed mode and hbase in pseudo distributed mode. I have crawled about 280GB data. But now when I start crawling . It gives following message and do not crawl any more in previous table

INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule

and following bug

ERROR store.HBaseStore - [Ljava.lang.StackTraceElement;@7ae0c96b

Documents are fetched but they are not saved in hbase. But if I crawl data in a new table, it works well and crawl properly witout any error. I think this is not a connection problem as for new table it works. I think it is bacause of some property etc.

Can anyone guide me as I am not an expert in apache nutch?

2 answers

这不是我的专业领域，但看起来像底层计算机上的线程耗尽。

As I was also facing similiar problem. Actual problem was with regionserver (Hbase deamon ). So try to restart it as it is shutdown when used with default seeting and data is too mutch in hbase. For more information, see log files of regionserver.

Apache nutch performance tuning for whole web crawling

Crawling with Apache Nutch 1.9 using Java code

Nutch 2.3 not generate/crawling

Nutch Crawling Result as JSON

Fetch particular tags from HTML docs obtained after crawling and parsing using Apache Nutch 1.4

Nutch Crawling not working for particular URL

Is there any Plugin in apache Nutch to index both webHtml and pdfs in raw content

Apache Nutch - NoSuchMethodError

Connecting MySQL to Apache nutch

Working with apache nutch 2.2.1

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Apache nutch performance tuning for whole web crawling Crawling with Apache Nutch 1.9 using Java code Nutch 2.3 not generate/crawling Nutch Crawling Result as JSON Fetch particular tags from HTML docs obtained after crawling and parsing using Apache Nutch 1.4 Nutch Crawling not working for particular URL Is there any Plugin in apache Nutch to index both webHtml and pdfs in raw content Apache Nutch - NoSuchMethodError Connecting MySQL to Apache nutch Working with apache nutch 2.2.1

Related Tags

Apache nutch is not crawling any more

Question

2 answers

solution1
0 2014-12-03 05:29:30

solution2
0 ACCPTED 2014-12-19 07:20:21

Apache nutch is not crawling any more

Question

2 answers

solution1 0 2014-12-03 05:29:30

solution2 0 ACCPTED 2014-12-19 07:20:21

solution1
0 2014-12-03 05:29:30

solution2
0 ACCPTED 2014-12-19 07:20:21