[英]whats wrong with this scrapy spider? scrapes only last url
在方法parse()
spider抓取4个URL,然后将其发送到方法parse_dir_contents()
抓取一些数据,但仅抓取了第4个网址,我不明白为什么它不抓取其他3个URL?
import scrapy
from v_one.items import VOneItem
import json
class linkedin(scrapy.Spider):
name = "linkedin"
allowed_domains = ["linkedin.com"]
start_urls = [
"https://in.linkedin.com/directory/people-s-1-2-4/",
]
def parse(self, response):
for href in response.xpath('//*[@id="seo-dir"]/div/div/div/ul/li/a/@href'):
url = response.urljoin(href.extract())
print "________________"+url
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//*[@id="profile"]'):
url = response.url
print "____________"+url
item = VOneItem()
item['name'] = sel.xpath('//*[@id="name"]/text()').extract()
item['headline'] = sel.xpath('//*[@id="topcard"]/div/div/div/p/span/text()').extract()
item['current'] = sel.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/span/text()').extract()
item['education'] = sel.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/a/text()').extract()
item['link'] = url
yield item
通过检查页面,我认为parse_dir_contents
函数中不需要for
循环。 使函数如下所示:
def parse_dir_contents(self, response):
item = VOneItem()
item['name'] = response.xpath('//*[@id="name"]/text()').extract()
item['headline'] = response.xpath('//*[@id="topcard"]/div/div/div/p/span/text()').extract()
item['current'] = response.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/span/text()').extract()
item['education'] = response.xpath('//*[@id="topcard"]/div/div/div/table/tbody/tr/td/ol/li/a/text()').extract()
item['link'] = response.url
return item
并检查是否可以解决您的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.