[英]filtering item in scrapy pipeline
I've scrapped the urls i want from a page. 我已经从网页中删除了想要的网址。 Now I want to filter them for keywords using a pipeline:
现在,我想使用管道为关键字过滤它们:
class GumtreeCouchesPipeline(object):
keywords = ['leather', 'couches']
def process_item(self, item, spider):
if any(key in item['url'] for key in keywords):
return item
Problem is its returning nothing now. 问题是它现在什么也不返回。
The spider: 蜘蛛:
import scrapy
from gumtree_couches.items import adItem
from urllib.parse import urljoin
class GumtreeSpider(scrapy.Spider):
name = 'GumtreeCouches'
allowed_domains = ['https://someurl']
start_urls = ['https://someurl']
def parse(self, response):
item = adItem()
for ad_links in response.xpath('//div[@class="view"][1]//a'):
relative_url = ad_links.xpath('@href').extract_first()
item['title'] = ad_links.xpath('text()').extract_first()
item['url'] = response.urljoin(relative_url)
yield item
How can I filter all the scraped urls for keywords using the pipeline? 如何使用管道过滤关键字所有已抓取的网址? Thanks!
谢谢!
This should fix your problem: 这应该可以解决您的问题:
class GumtreeCouchesPipeline(object):
keywords = ['leather', 'couches']
def process_item(self, item, spider):
if any(key in item['url'] for key in self.keywords):
return item
Notice that I'm using self.keywords
to refer to the keywords
class attribute. 请注意,我正在使用
self.keywords
来引用keywords
class属性。
If you look at your spider logs, you should find some errors saying something like: NameError: name 'keywords' is not defined
. 如果查看蜘蛛日志,应该会发现一些错误,例如:
NameError: name 'keywords' is not defined
。
Anyway, I'd recommend you to implement this pipeline like this: 无论如何,我建议您像这样实现此管道:
from scrapy.exceptions import DropItem
class GumtreeCouchesPipeline(object):
keywords = ['leather', 'couches']
def process_item(self, item, spider):
if not any(key in item['url'] for key in self.keywords):
raise DropItem('missing keyword in URL')
return item
This way, you'll have the information about the dropped items in the job stats once it's finished. 这样,完成后,您将在作业统计信息中获得有关已删除项目的信息。
From reading the documentation I think you have to cater for all paths eg 通过阅读文档,我认为您必须适应所有路径,例如
from scrapy.exceptions import DropItem
def process_item(self, item, spider):
keywords = ['leather', 'couches']
if item['url']:
if any(key in item['url'] for key in keywords):
return item
else
raise DropItem("Missing specified keywords.")
else
return item
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.