抓取抓取多個頁面，提取數據並保存到mysql中

Question

嗨，有人可以幫我一個忙，我似乎被困住了，我正在學習如何抓取並保存到mysql中。 我試圖抓取所有網站頁面。 以“ start_urls”開始，但它似乎並不會自動僅對所有頁面進行爬網，而是使用pipelines.py將其保存到mysql中。 當在af = open（“ urls.txt”）中提供url時，它也會對所有頁面進行爬網，並使用pipes.py保存數據。

這是我的代碼

test.py

import scrapy
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from gotp.items import GotPItem
from scrapy.log import *
from gotp.settings import *
from gotp.items import *

class GotP(CrawlSpider):
    name = "gotp"
    allowed_domains = ["www.craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/sss"]
    rules = [
        Rule(SgmlLinkExtractor(
            allow=('')),
            callback ="parse",
            follow=True
        )
    ]
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        prices = hxs.select("//div[@class="sliderforward arrow"]")
        for price in prices:
            item = GotPItem()
            item ["price"] = price.select("text()").extract()
            yield item

Answer 1

如果我理解正確，則您正在嘗試遵循分頁並提取結果。

在這種情況下，可以避免使用CrawlSpider並使用常規的Spider類。

這個想法是解析第一頁，提取總成績計算，計算有多少頁面去產生scrapy.Request實例相同的URL提供s GET參數值。

實施示例：

import scrapy

class GotP(scrapy.Spider):
    name = "gotp"
    allowed_domains = ["www.sfbay.craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/sss"]

    results_per_page = 100

    def parse(self, response):
        total_count = int(response.xpath('//span[@class="totalcount"]/text()').extract()[0])
        for page in xrange(0, total_count, self.results_per_page):
            yield scrapy.Request("http://sfbay.craigslist.org/search/sss?s=%s&" % page, callback=self.parse_result, dont_filter=True)

    def parse_result(self, response):
        results = response.xpath("//p[@data-pid]")
        for result in results:
            try:
                print result.xpath(".//span[@class='price']/text()").extract()[0]
            except IndexError:
                print "Unknown price"

這將遵循控制台上的分頁和打印價格。 希望這是一個好的起點。

抓取抓取多個頁面，提取數據並保存到mysql中

問題描述

test.py

1 個解決方案

解決方案1
0 已采納 2015-03-31 05:01:30

抓取抓取多個頁面，提取數據並保存到mysql中

問題描述

test.py

1 個解決方案

解決方案1 0 已采納 2015-03-31 05:01:30

解決方案1
0 已采納 2015-03-31 05:01:30