简体   繁体   English

过滤刮擦管道中的物料

[英]filtering item in scrapy pipeline

I've scrapped the urls i want from a page. 我已经从网页中删除了想要的网址。 Now I want to filter them for keywords using a pipeline: 现在,我想使用管道为关键字过滤它们:

class GumtreeCouchesPipeline(object):

keywords = ['leather', 'couches']

def process_item(self, item, spider):
    if any(key in item['url'] for key in keywords):
        return item

Problem is its returning nothing now. 问题是它现在什么也不返回。

The spider: 蜘蛛:

import scrapy
from gumtree_couches.items import adItem
from urllib.parse import urljoin

class GumtreeSpider(scrapy.Spider):
    name = 'GumtreeCouches'
    allowed_domains = ['https://someurl']
    start_urls = ['https://someurl']


def parse(self, response):
    item = adItem()
    for ad_links in response.xpath('//div[@class="view"][1]//a'):
        relative_url = ad_links.xpath('@href').extract_first()
        item['title'] = ad_links.xpath('text()').extract_first()
        item['url'] = response.urljoin(relative_url)

        yield item

How can I filter all the scraped urls for keywords using the pipeline? 如何使用管道过滤关键字所有已抓取的网址? Thanks! 谢谢!

This should fix your problem: 这应该可以解决您的问题:

class GumtreeCouchesPipeline(object):

    keywords = ['leather', 'couches']

    def process_item(self, item, spider):
        if any(key in item['url'] for key in self.keywords):
            return item

Notice that I'm using self.keywords to refer to the keywords class attribute. 请注意,我正在使用self.keywords来引用keywords class属性。

If you look at your spider logs, you should find some errors saying something like: NameError: name 'keywords' is not defined . 如果查看蜘蛛日志,应该会发现一些错误,例如: NameError: name 'keywords' is not defined

Anyway, I'd recommend you to implement this pipeline like this: 无论如何,我建议您像这样实现此管道:

from scrapy.exceptions import DropItem

class GumtreeCouchesPipeline(object):

    keywords = ['leather', 'couches']

    def process_item(self, item, spider):
        if not any(key in item['url'] for key in self.keywords):
            raise DropItem('missing keyword in URL')
        return item

This way, you'll have the information about the dropped items in the job stats once it's finished. 这样,完成后,您将在作业统计信息中获得有关已删除项目的信息。

From reading the documentation I think you have to cater for all paths eg 通过阅读文档,我认为您必须适应所有路径,例如

from scrapy.exceptions import DropItem

    def process_item(self, item, spider):
        keywords = ['leather', 'couches']
        if item['url']:
            if any(key in item['url'] for key in keywords):
                return item
            else
                raise DropItem("Missing specified keywords.")
        else
            return item

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM