简体   繁体   中英

how to crawl particular website using Apache Nutch?

i have followed below url and done successfully till Step-by-Step: Invertlinks

https://wiki.apache.org/nutch/NutchTutorial#Crawl_your_first_website

But i didn't get any data regarding them

i am new to this techno,

please give steps/demo/site/example if someone has done it before successfully. And please do not give rough steps.

first install the nutch:

under configuration of nutch-site.xml, paste:

<property>
    <name>http.agent.name</name>
    <value>My Nutch Spider</value>
</property>

Under your nutch-default.xml: add

<property>
  <name>http.robot.rules.whitelist</name>
  <value>http://nihilent.com/</value>
  <description>Comma separated list of hostnames or IP addresses to ignore
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!
  </description>
</property>

Under regex-urlfilter.txt :

# accept anything else
+.
+^http://([a-z0-9]*\.)*http://nihilent.com/

and also comment the

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

then run the below commands

bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

Now check your data in the crawl/crawldb folder & other successfully.

Below are few commands which will help you while doing Nutch in various ways

  • These command contains direct crwaling on console , big data reading dumpin etc
  • I am mentioning all available command which i had done please modify as per your requirements

Commands Nutch

 bin/nutch inject crawl/crawldb dmoz bin/nutch inject crawl/crawldb urls bin/nutch generate crawl/crawldb crawl/segments s4=`ls -d crawl/segments/2* | tail -1` echo $s1 bin/nutch fetch $s1 bin/nutch parse $s1 bin/nutch updatedb crawl/crawldb $s1 
bin/nutch invertlinks crawl/linkdb -dir crawl/segments

bin/nutch commoncrawldump -outputDir hdfs://localhost:9000/dfs -segment /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ -jsonArray -reverseKey -SimpleDateFormat -epochFilename

bin/nutch readseg -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments/ /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/1

bin/nutch readseg -get /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/segments http://1465212304000.html -nofetch -nogenerate -noparse -noparsedata -noparsetext

bin/nutch parsechecker -dumpText http://nihilent.com/

bin/nutch readlinkdb /home/lokesh_Kumar/soft/apache-nutch-1.11/crawl/linkdb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn/3

bin/nutch readdb crawl/crawldb -dump /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/Data/Team-A/fileLinkedIn

bin/nutch readdb crawl/crawldb -dump /hdfs://localhost:9000/dfs

hadoop fs -copyFromLocal 

hadoop fs -copyFromLocal /home/lokesh_Kumar/soft/apache-nutch-1.11/ndeploy/data/commoncrawl/com hdfs://localhost:9000/dfs

added new answer just because of avoid sandwich data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM