简体   繁体   中英

How do I build a web crawler that can extract particular information from any site?

So I'm trying to build a web crawler that I can turn on any review site and have it fairly reliably scrape user reviews from the text. That is, rather than building a scraper for say Amazon and Overstocked, I just want a single scraper that can scrape the reviews for a product off of both them even if were to sacrifice accuracy. I've briefly spoken with one of my professors and he mentioned that I could basically just implement some heuristics and collect the data from that (as a basic example, just take all the text within p tags). At the moment, I'm really just looking for some advice on which direction to head.

(If it matters any, at the moment I'm using mechanize and lxml (Python) to crawl individual sites.)

Thanks!

There isn't really an 'answer' to this question, but for the benefit of anyone coming across this question:

The concept of a 'generic' scraper is - at best - an interesting academic exercise. It is not likely to be possible in any useful way.

Two useful projects to look at are Scrapy , a python web scraping framework and http://www.nltk.org/ , the Natural Language Toolkit , a large collection of python modules relating to the processing of er, natural language text.

Back in the day (circa 1993), I wrote a spider to extract targeted content from a variety of sites that used a collection of "rules" defined for each site.

Rules were expressed as regular expressions and were categorized as either "preparation" rules (those that massaged retrieved pages to better identify/isolate extractable data) and "extraction" rules (those that caused useful data to be extracted.)

So for example, given the page:

<html>
  <head><title>A Page</title></head>
  <body>
  <!-- Other stuff here -->
  <div class="main">
    <ul>
      <li>Datum 1</li>
      <li>Datum 2</li>
    </ul>
  </div>
  <!-- Other stuff here -->
  <div>
    <ul>
      <li>Extraneous 1</li>
      <li>Extraneous 2</li>
    </ul>
  </div>
  <!-- Other stuff here -->
  </body>
</html>

The rules to extract only the 'Datum' values might be:

  1. strip leading part using '^.*?<div class="main">'
  2. strip trailing part using '</div>.+</html>$'
  3. extract into result using '<li>([^<]+)</li>'

This worked well for most sites until they changed their layout, at which point the rules for that site required adjusting.

Today, I'd probably do the same thing using Dave Raggett's HTMLTidy to normalize all retrieved pages into legal XHTML and XPATH/XSLT to massage the page into the correct format.

There's an RDF vocabulary for reviews , and also a microformat . If your reviews are in this format, they will be easy to parse.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM