简体   繁体   中英

Crawl Web Data using Web Crawler

I would like to use a web crawler and crawl a particular website. The website is a learning management system where many student upload their assignments,project presentations and so on. My question is that can i use a web crawler and download the files that have been uploaded in the learning management system. After i download them i would like to create an index on them so as to query the set of documents. User can use my application as a search engine. Can a crawler does this? I know about webeater ( Crawler written in Java )

  1. Download the files in Java SingleThread.
  2. Parse the files (you can get idea from parse plugins of nutch).
  3. Create index with lucene

If you want to use a real webcrawler, user http://www.httrack.com/

It offers you so many options for copying websites or content on webpages including flash. It works on windows and mac.

Then you can do steps 2 and 3 as suggested above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM