简体   繁体   English

递归刮后没有刮数据

[英]No data scraped after recursive scraping

I'm trying to recursively scrape the titles of jobs from https://iowacity.craigslist.org/search/jjj . 我试图从https://iowacity.craigslist.org/search/jjj递归地抓取职位的标题。 That is to say, I want the spider to scrape all the job titles on Page 1 and then follow the link "next>" on the bottom to scrape the next page, and so on. 也就是说,我希望蜘蛛抓取第1页上的所有职位,然后单击底部的链接“ next>”来抓取下一页,依此类推。 I mimicked Michael Herman's tutorial to write my spider. 我模仿了迈克尔·赫尔曼(Michael Herman)的教程来编写蜘蛛。 http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.ViJ6rPmrTIU . http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.ViJ6rPmrTIU

Here is my code: 这是我的代码:

import scrapy
from craig_rec.items import CraigRecItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class CraigslistSpider(CrawlSpider):
    name = "craig_rec"
    allowed_domains = ["https://craigslist.org"]
    start_urls = ["https://iowacity.craigslist.org/search/jjj"]

    rules = (
        Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_items", follow= True),
)

    def parse_items(self, response):
        items = []
        for sel in response.xpath("//span[@class = 'pl']"):
            item = CraigRecItem()
            item['title'] = sel.xpath("a/text()").extract()
            items.append(item)
        return items  

I released the spider but no data was scraped. 我释放了蜘蛛,但没有数据被刮擦。 Any help? 有什么帮助吗? Thanks! 谢谢!

When you set your allowed_domains to " https://craigslist.org " it stops crawling due to offsite request to subdomain 'iowacity.craigslist.org'. 当您将allowed_domains设置为“ https://craigslist.org ”时,由于对子域“ iowacity.craigslist.org”的异地请求,它停止了爬网。

You must set it as: 您必须将其设置为:

allowed_domains = ["craigslist.org"]

According to the docs allowed_domains is a list of strings containing domains that this spider is allowed to crawl. 根据文档, allowed_domains是包含允许该蜘蛛爬网的域的字符串列表。 It expects it to be in the format domain.com , which allows the domain itself and all the subdomains to be parsed by the spider. 它期望它采用domain.com格式,该格式允许蜘蛛本身解析域本身和所有子域。

You can also be specific allowing only few subdomains or allow all requests by leaving the attribute empty. 您还可以通过仅将属性保留为空来特定允许仅几个子域或允许所有请求。

Michael Herman's tutorial is great, but for an older version of scrapy. 迈克尔·赫尔曼(Michael Herman)的教程很棒,但是适用于旧版本的scrapy。 This snippet avoids some deprecation warnings, and also turns parse_page into a generator: 此代码段避免了一些过时警告,并且还将parse_page变成了生成器:

import scrapy
from craig_rec.items import CraigRecItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class CraiglistSpider(CrawlSpider):
    name = "craiglist"
    allowed_domains = ["craigslist.org"]
    start_urls = (
        'https://iowacity.craigslist.org/search/jjj/',
    )

    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="button next"]',)),
             callback="parse_page", follow=True),
    )

    def parse_page(self, response):
        for sel in response.xpath("//span[@class = 'pl']"):
            item = CraigRecItem()
            item['title'] = sel.xpath(".//a/text()").extract()
            yield item

This post also has some great tips on scraping Craigslist. 这篇文章在刮取Craigslist上也有一些很棒的技巧。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM