这只刮scrap的蜘蛛怎么了？仅抓取最后一个网址

Question

在方法parse() spider抓取4个URL，然后将其发送到方法parse_dir_contents()抓取一些数据，但仅抓取了第4个网址，我不明白为什么它不抓取其他3个URL？

import scrapy
from v_one.items import VOneItem
import json

class linkedin(scrapy.Spider):
    name = "linkedin"
    allowed_domains = ["linkedin.com"]
    start_urls = [
    "https://in.linkedin.com/directory/people-s-1-2-4/",
    ]

    def parse(self, response):

        for href in response.xpath('//*[@id="seo-dir"]/div/div/div/ul/li/a/@href'):
            url = response.urljoin(href.extract())    
            print "________________"+url 
            yield scrapy.Request(url, callback=self.parse_dir_contents)



    def parse_dir_contents(self, response):

        for sel in response.xpath('//*[@id="profile"]'):
            url = response.url
            print "____________"+url            
            item = VOneItem()
            item['name'] = sel.xpath('//*[@id="name"]/text()').extract()
            item['headline'] = sel.xpath('//*[@id="topcard"]/div/div/div/p/span/text()').extract()
            item['current'] = sel.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/span/text()').extract()
            item['education'] = sel.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/a/text()').extract()
            item['link'] = url
            yield item

Answer 1

通过检查页面，我认为parse_dir_contents函数中不需要for循环。 使函数如下所示：

def parse_dir_contents(self, response):
        item = VOneItem()
        item['name'] = response.xpath('//*[@id="name"]/text()').extract()
        item['headline'] = response.xpath('//*[@id="topcard"]/div/div/div/p/span/text()').extract()
        item['current'] = response.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/span/text()').extract()
        item['education'] = response.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/a/text()').extract()
        item['link'] = response.url
        return item

并检查是否可以解决您的问题。

这只刮scrap的蜘蛛怎么了？仅抓取最后一个网址

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-06-25 10:53:32

这只刮scrap的蜘蛛怎么了？ 仅抓取最后一个网址

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-06-25 10:53:32

这只刮scrap的蜘蛛怎么了？仅抓取最后一个网址

解决方案1
0 已采纳 2016-06-25 10:53:32