简体   繁体   中英

recursive crawling with Python and Scrapy

I'm using scrapy to crawl a site. The site has 15 listings per page and then has a next button. I am running into an issue where my Request for the next link is being called before I am finished parsing all of my listings in pipeline. Here is the code for my spider:

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_url = 'http://www.mysite.com/'

    def start_requests(self):
        return [Request(self.start_url, callback=self.parse_listings)]

    def parse_listings(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select('...')

        for listing in listings:
            il = MySiteLoader(selector=listing)
            il.add_xpath('Title', '...')
            il.add_xpath('Link', '...')

            item = il.load_item()
            listing_url = listing.select('...').extract()

            if listing_url:
                yield Request(urlparse.urljoin(response.url, listing_url[0]),
                              meta={'item': item},
                              callback=self.parse_listing_details)

        next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                                   'div[@class="next-link"]/a/@href').extract()
        if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)


    def parse_listing_details(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.request.meta['item']
        details = hxs.select('...')
        il = MySiteLoader(selector=details, item=item)

        il.add_xpath('Posted_on_Date', '...')
        il.add_xpath('Description', '...')
        return il.load_item()

These lines are the problem. Like I said before, they are being executed before the spider has finished crawling the current page. On every page of the site, this causes only 3 out 15 of my listings to be sent to the pipeline.

     if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)

This is my first spider and might be a design flaw on my part, is there a better way to do this?

Scrape instead of spider?

Because your original problem requires the repeated navigation of a consecutive and repeated set of content instead of a tree of content of unknown size, use mechanize (http://wwwsearch.sourceforge.net/mechanize/) and beautifulsoup (http://www.crummy.com/software/BeautifulSoup/).

Here's an example of instantiating a browser using mechanize. Also, using the br.follow_link(text="foo") means that, unlike the xpath in your example, the links will still be followed no matter the structure of the elements in the ancestor path. Meaning, if they update their HTML your script breaks. A looser coupling will save you some maintenance. Here is an example:

br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)Gecko/20100101 Firefox/9.0.1')]
br.addheaders = [('Accept-Language','en-US')]
br.addheaders = [('Accept-Encoding','gzip, deflate')]
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.open("http://amazon.com")
br.follow_link(text="Today's Deals")
print br.response().read()

Also, in the "next 15" href there is probably something indicating pagination eg &index=15. If the total number of items on all pages is available on the first page, then:

soup = BeautifulSoup(br.response().read())
totalItems = soup.findAll(id="results-count-total")[0].text
startVar =  [x for x in range(int(totalItems)) if x % 15 == 0]

Then just iterate over startVar and create the url, add the value of startVar to the url, br.open() it and scrape the data. That way you don't have to programatically "find" the "next" link on the page and execute a click on it to advance to the next page - you already know all the valid urls. Minimizing code driven manipulation of the page to only the data you need will speed up your extraction.

There are two ways of doing this sequentially:

  1. by defining a listing_url list under the class.
  2. by defining the listing_url inside the parse_listings() .

The only difference is verbage. Also, suppose there are five pages to get listing_urls . So put page=1 under class as well.

In the parse_listings method, only make a request once. Put all the data into the meta that you need to keep track of. That being said, use parse_listings only to parse the 'front page'.

Once you reached the end of the line, return your items. This process is sequential.

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_url = 'http://www.mysite.com/'

    listing_url = []
    page = 1

    def start_requests(self):
        return [Request(self.start_url, meta={'page': page}, callback=self.parse_listings)]

    def parse_listings(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select('...')

        for listing in listings:
            il = MySiteLoader(selector=listing)
            il.add_xpath('Title', '...')
            il.add_xpath('Link', '...')

        items = il.load_item()

        # populate the listing_url with the scraped URLs
        self.listing_url.extend(listing.select('...').extract())

        next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                                   'div[@class="next-link"]/a/@href').extract()

        # now that the front page is done, move on to the next listing_url.pop(0)
        # add the next_page_url to the meta data
        return Request(urlparse.urljoin(response.url, self.listing_url.pop(0)),
                            meta={'page': self.page, 'items': items, 'next_page_url': next_page_url},
                            callback=self.parse_listing_details)

    def parse_listing_details(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.request.meta['item']
        details = hxs.select('...')
        il = MySiteLoader(selector=details, item=item)

        il.add_xpath('Posted_on_Date', '...')
        il.add_xpath('Description', '...')
        items = il.load_item()

        # check to see if you have any more listing_urls to parse and last page
        if self.listing_urls:
            return Request(urlparse.urljoin(response.url, self.listing_urls.pop(0)),
                            meta={'page': self.page, 'items': items, 'next_page_url': response.meta['next_page_url']},
                            callback=self.parse_listings_details)
        elif not self.listing_urls and response.meta['page'] != 5:
            # loop back for more URLs to crawl
            return Request(urlparse.urljoin(response.url, response.meta['next_page_url']),
                            meta={'page': self.page + 1, 'items': items},
                            callback=self.parse_listings)
        else:
            # reached the end of the pages to crawl, return data
            return il.load_item()

You can yield requests or items as many times as you need.

def parse_category(self, response):
    # Get links to other categories
    categories = hxs.select('.../@href').extract()

    # First, return CategoryItem
    yield l.load_item()

    for url in categories:
        # Than return request for parse category
        yield Request(url, self.parse_category)

I found that here — https://groups.google.com/d/msg/scrapy-users/tHAAgnuIPR4/0ImtdyIoZKYJ

See below for an updated answer, under the EDIT 2 section (updated October 6th, 2017)

Is there any specific reason that you're using yield? Yield will return a generator, which will return the Request object when .next() is invoked on it.

Change your yield statements to return statements and things should work as expected.

Here's an example of a generator:

In [1]: def foo(request):
   ...:     yield 1
   ...:     
   ...:     

In [2]: print foo(None)
<generator object foo at 0x10151c960>

In [3]: foo(None).next()
Out[3]: 1

EDIT:

Change your def start_requests(self) function to use the follow parameter.

return [Request(self.start_url, callback=self.parse_listings, follow=True)]

EDIT 2:

As of Scrapy v1.4.0, released on 2017-05-18, it is now recommended to use response.follow instead of creating scrapy.Request objects directly.

From the release notes :

There's a new response.follow method for creating requests; it is now a recommended way to create Requests in Scrapy spiders. This method makes it easier to write correct spiders; response.follow has several advantages over creating scrapy.Request objects directly:

  • it handles relative URLs;
  • it works properly with non-ascii URLs on non-UTF8 pages;
  • in addition to absolute and relative URLs it supports Selectors; for elements it can also extract their href values.

So, for the OP above, change the code from:

    next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                               'div[@class="next-link"]/a/@href').extract()
    if next_page_url:
        yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                      callback=self.parse_listings)

to:

    next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                               'div[@class="next-link"]/a/@href')
    if next_page_url is not None:
        yield response.follow(next_page_url, self.parse_listings)

I just fixed this same problem in my code. I used the SQLite3 database that comes as part of Python 2.7 to fix it: Each item you are collecting info about gets its unique line put into a database table in the first pass of the parse function, and each instance of the parse callback adds each item's data to the table and line for that item. Keep a instance counter so that the last callback parse routine knows that it is last one, and writes the CSV file from the database or whatever. The callback can be recursive, being told in meta which parse schema (and of course which item) it was dispatched to work with. Works for me like a charm. You have SQLite3 if you have Python. Here was my post when I first discovered scrapy's limitation in this regard: Is Scrapy's asynchronicity what is hindering my CSV results file from being created straightforwardly?

http://autopython.blogspot.com/2014/04/recursive-scraping-using-different.html

this example show how to scrap mulitple next pages from a website using different techniques

You might want to look into two things.

  1. The website you are crawling may be blocking the user agent you have defined.
  2. Try adding a DOWNLOAD_DELAY to your spider.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM