繁体   English   中英

这只刮scrap的蜘蛛怎么了? 仅抓取最后一个网址

[英]whats wrong with this scrapy spider? scrapes only last url

在方法parse() spider抓取4个URL,然后将其发送到方法parse_dir_contents()抓取一些数据,但仅抓取了第4个网址,我不明白为什么它不抓取其他3个URL?

import scrapy
from v_one.items import VOneItem
import json

class linkedin(scrapy.Spider):
    name = "linkedin"
    allowed_domains = ["linkedin.com"]
    start_urls = [
    "https://in.linkedin.com/directory/people-s-1-2-4/",
    ]

    def parse(self, response):

        for href in response.xpath('//*[@id="seo-dir"]/div/div/div/ul/li/a/@href'):
            url = response.urljoin(href.extract())    
            print "________________"+url 
            yield scrapy.Request(url, callback=self.parse_dir_contents)



    def parse_dir_contents(self, response):

        for sel in response.xpath('//*[@id="profile"]'):
            url = response.url
            print "____________"+url            
            item = VOneItem()
            item['name'] = sel.xpath('//*[@id="name"]/text()').extract()
            item['headline'] = sel.xpath('//*[@id="topcard"]/div/div/div/p/span/text()').extract()
            item['current'] = sel.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/span/text()').extract()
            item['education'] = sel.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/a/text()').extract()
            item['link'] = url
            yield item

通过检查页面,我认为parse_dir_contents函数中不需要for循环。 使函数如下所示:

def parse_dir_contents(self, response):
        item = VOneItem()
        item['name'] = response.xpath('//*[@id="name"]/text()').extract()
        item['headline'] = response.xpath('//*[@id="topcard"]/div/div/div/p/span/text()').extract()
        item['current'] = response.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/span/text()').extract()
        item['education'] = response.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/a/text()').extract()
        item['link'] = response.url
        return item

并检查是否可以解决您的问题。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM