i have followed below url and done successfully till Step-by-Step: Invertlinks
https://wiki.apache.org/nutch/NutchTutorial#Crawl_your_first_website
But i didn't get any data regarding them
i am new to this techno,
please give steps/demo/site/example if someone has done it before successfully. And please do not give rough steps.
first install the nutch:
under configuration of nutch-site.xml, paste:
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
Under your nutch-default.xml: add
<property>
<name>http.robot.rules.whitelist</name>
<value>http://nihilent.com/</value>
<description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for. Use with care and only if you are explicitly
allowed by the site owner to ignore the site's robots.txt!
</description>
</property>
Under regex-urlfilter.txt :
# accept anything else
+.
+^http://([a-z0-9]*\.)*http://nihilent.com/
and also comment the
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
then run the below commands
bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
Now check your data in the crawl/crawldb folder & other successfully.
Below are few commands which will help you while doing Nutch in various ways
bin/nutch inject crawl/crawldb dmoz bin/nutch inject crawl/crawldb urls bin/nutch generate crawl/crawldb crawl/segments s4=`ls -d crawl/segments/2* | tail -1` echo $s1 bin/nutch fetch $s1 bin/nutch parse $s1 bin/nutch updatedb crawl/crawldb $s1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
bin/nutch commoncrawldump -outputDir hdfs://localhost:9000/dfs -segment /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ -jsonArray -reverseKey -SimpleDateFormat -epochFilename
bin/nutch readseg -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/1
bin/nutch readseg -get /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments http://1465212304000.html -nofetch -nogenerate -noparse -noparsedata -noparsetext
bin/nutch parsechecker -dumpText http://nihilent.com/
bin/nutch readlinkdb /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/linkdb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn/3
bin/nutch readdb crawl/crawldb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn
bin/nutch readdb crawl/crawldb -dump /hdfs://localhost:9000/dfs
hadoop fs -copyFromLocal
hadoop fs -copyFromLocal /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/data/commoncrawl/com hdfs://localhost:9000/dfs
added new answer just because of avoid sandwich data
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.