how to crawl particular website using Apache Nutch?

Question

i have followed below url and done successfully till Step-by-Step: Invertlinks

https://wiki.apache.org/nutch/NutchTutorial#Crawl_your_first_website

But i didn't get any data regarding them

i am new to this techno,

please give steps/demo/site/example if someone has done it before successfully. And please do not give rough steps.

Answer 1

first install the nutch:

under configuration of nutch-site.xml, paste:

<property>
    <name>http.agent.name</name>
    <value>My Nutch Spider</value>
</property>

Under your nutch-default.xml: add

<property>
  <name>http.robot.rules.whitelist</name>
  <value>http://nihilent.com/</value>
  <description>Comma separated list of hostnames or IP addresses to ignore
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!
  </description>
</property>

Under regex-urlfilter.txt :

# accept anything else
+.
+^http://([a-z0-9]*\.)*http://nihilent.com/

and also comment the

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

then run the below commands

bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

Now check your data in the crawl/crawldb folder & other successfully.

Answer 2

Below are few commands which will help you while doing Nutch in various ways

These command contains direct crwaling on console , big data reading dumpin etc
I am mentioning all available command which i had done please modify as per your requirements

Commands Nutch

 bin/nutch inject crawl/crawldb dmoz bin/nutch inject crawl/crawldb urls bin/nutch generate crawl/crawldb crawl/segments s4=`ls -d crawl/segments/2* | tail -1` echo $s1 bin/nutch fetch $s1 bin/nutch parse $s1 bin/nutch updatedb crawl/crawldb $s1

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

bin/nutch commoncrawldump -outputDir hdfs://localhost:9000/dfs -segment /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ -jsonArray -reverseKey -SimpleDateFormat -epochFilename

bin/nutch readseg -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/1

bin/nutch readseg -get /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments http://1465212304000.html -nofetch -nogenerate -noparse -noparsedata -noparsetext

bin/nutch parsechecker -dumpText http://nihilent.com/

bin/nutch readlinkdb /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/linkdb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn/3

bin/nutch readdb crawl/crawldb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn

bin/nutch readdb crawl/crawldb -dump /hdfs://localhost:9000/dfs

hadoop fs -copyFromLocal 

hadoop fs -copyFromLocal /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/data/commoncrawl/com hdfs://localhost:9000/dfs

added new answer just because of avoid sandwich data

how to crawl particular website using Apache Nutch?

Question

2 answers

solution1
0 ACCPTED 2016-01-13 12:19:47

solution2
0 2016-08-11 05:23:55

Commands Nutch

how to crawl particular website using Apache Nutch?

Question

2 answers

solution1 0 ACCPTED 2016-01-13 12:19:47

solution2 0 2016-08-11 05:23:55

Commands Nutch

solution1
0 ACCPTED 2016-01-13 12:19:47

solution2
0 2016-08-11 05:23:55