简体   繁体   中英

Recursively Scraping Web Pages With Scrapy

" http://www.example.com/listing.php?num=2& "

here is the code of my spider that displays the list of links on a single page :

from scrapy.log import *
from crawler_bhinneka.settings import *
from crawler_bhinneka.items import *
import pprint
from MySQLdb import escape_string
import urlparse

def complete_url(string):
    """Return complete url"""
    return "http://www.example.com" + string


class BhinnekaSpider(CrawlSpider):

    name = 'bhinneka_spider'
    start_urls = [
        'http://www.example.com/listing.php?'
    ]
    def parse(self, response):

        hxs = HtmlXPathSelector(response)

        # HXS to find url that goes to detail page
        items = hxs.select('//td[@class="lcbrand"]/a/@href')
        for item in items:
            link = item.extract()
            print("my Url Link : ",complete_url(link))

know I can get All link in my first page.

I want to macke this spider with recursive rule to follow the link of next page Do you know how to try my rules in the spider to get link values of next pages.

EDIT

@Toan, thank you for the reply. I tried to make this tutorial link you sent me, but I just take the item values of one page (first page).

I looked at the source code at this url: " http://sfbay.craigslist.org/npo/ " and I do not see the values of xpath that matches in this restrict_xpaths (class = "nextpage doies not exist in the code source)

here is the rule of yours link example :

   rules = (Rule (SgmlLinkExtractor (allow = ("index \ d00 \. html") restrict_xpaths = ('/ / p [@ class = "nextpage"]'))
     , Callback = "parse_items" follow = True)
     )

Scrapy linkextractors is used for extracting links from web pages.

Here's an example: http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.U9Dl8h_FsUQ

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM