简体   繁体   English

如何从外部python脚本正确运行scrapy蜘蛛并获取其项目输出

[英]How can I properly run scrapy spiders from an external python script and get its item output

So I'm making a couple of scrapers and now I'm trying to make a script that runs the corresponding spiders with URLs collected from a DB but I can't find a way to do this.所以我正在制作几个刮刀,现在我正在尝试制作一个脚本,该脚本运行带有从数据库收集的 URL 的相应蜘蛛,但我找不到这样做的方法。

I have this in my spider:我的蜘蛛中有这个:

class ElCorteIngles(scrapy.Spider):
name = 'ElCorteIngles'
url = ''
DEBUG = False

def start_requests(self):
    if self.url != '':
        yield scrapy.Request(url=self.url, callback=self.parse)

def parse(self, response):
    # Get product name
    try:
        self.p_name = response.xpath('//*[@id="product-info"]/h2[1]/a/text()').get()
    except:
        print(f'{CERROR} Problem while getting product name from website - {self.name}')

    # Get product price
    try:
        self.price_no_cent = response.xpath('//*[@id="price-container"]/div/span[2]/text()').get()
        self.cent = response.xpath('//*[@id="price-container"]/div/span[2]/span[1]/text()').get()
        self.currency = response.xpath('//*[@id="price-container"]/div/span[2]/span[2]/text()').get()
        if self.currency == None:
            self.currency = response.xpath('//*[@id="price-container"]/div/span[2]/span[1]/text()').get()
            self.cent = None
    except:
        print(f'{CERROR} Problem while getting product price from website - {self.name}')

    # Join self.price_no_cent with self.cent
    try:
        if self.cent != None:
            self.price = str(self.price_no_cent) + str(self.cent)
            self.price = self.price.replace(',', '.')
        else:
            self.price = self.price_no_cent
    except:
        print(f'{ERROR} Problem while joining price with cents - {self.name}')

    # Return data
    if self.DEBUG == True:
        print([self.p_name, self.price, self.currency])

    data_collected = ShopScrapersItems()
    data_collected['url'] = response.url
    data_collected['p_name'] = self.p_name
    data_collected['price'] = self.price
    data_collected['currency'] = self.currency

    yield data_collected

Normally when I run the spider from the console I do:通常,当我从控制台运行蜘蛛时,我会这样做:

scrapy crawl ElCorteIngles -a url='https://www.elcorteingles.pt/electrodomesticos/A26601428-depiladora-braun-senso-smart-5-5500/'

and now I need a way to do the same on a external script and get the output yield data_collected现在我需要一种方法在外部脚本上做同样的事情并获得输出yield data_collected

What I currently have in my external script is this:我目前在外部脚本中的内容是:

import scrapy
from scrapy.crawler import CrawlerProcess
import sqlalchemy as db
# Import internal libraries
from Ruby.Ruby.spiders import *

# Variables
engine = db.create_engine('mysql+pymysql://DATABASE_INFO')

class Worker(object):

    def __init__(self):
        self.crawler = CrawlerProcess({})

    def scrape_new_links(self):
        conn = engine.connect()

        # Get all new links from DB and scrape them
        query = 'SELECT * FROM Ruby.New_links'
        result = conn.execute(query)
        for x in result:
            telegram_id = x[1]
            email = x[2]
            phone_number = x[3]
            url = x[4]
            spider = x[5]
            
            # In this cade the spider will be ElCorteIngles and
            # the url https://www.elcorteingles.pt/electrodomesticos/A26601428-depiladora- 
            # braun-senso-smart-5-5500/'

            self.crawler.crawl(spider, url=url)
            self.crawler.start()

Worker().scrape_new_links()

I also don't know if doing url=url in self.crawler.crawl() is the proper way to give the URL to the spider but let me know what you think.我也不知道在self.crawler.crawl()执行url=url是否是将 URL 提供给蜘蛛的正确方法,但请告诉我您的想法。 All data from yield is being returned by a pipeline.来自yield所有数据都由管道返回。 I think there is no need for extra info but if you need any just let me know!我认为不需要额外的信息,但如果您需要任何信息,请告诉我!

Scrapy works asynchronously...ignore my imports but this is a JSON api I made for scrapy. Scrapy 异步工作……忽略我的导入,但这是我为 scrapy 制作的 JSON api。 You need to make a custom runner with an item_scraped signal.您需要使用 item_scraped 信号制作自定义跑步者。 There was originally a klein endpoint and when the spider finished it would return a JSON list.最初有一个 klein 端点,当蜘蛛完成时它会返回一个 JSON 列表。 I think this is what you want but without the klein endpoint so I've taken it out.我认为这就是你想要的,但没有 klein 端点,所以我把它拿出来了。 My spider was GshopSpider I replaced it with your spiders name.我的蜘蛛是 GshopSpider 我用你的蜘蛛名字代替了它。

By taking advantage of deferred we are able to use callbacks and send signals each time an item is scraped.通过利用延迟,我们能够使用回调并在每次抓取项目时发送信号。 So using this code we collect each item into a list with a signal and when the spider finishes we have a callback setup to return_spider_output所以使用这段代码,我们将每个项目收集到一个带有信号的列表中,当蜘蛛完成时,我们有一个回调设置来 return_spider_output

# server.py
import json

from scrapy import signals
from scrapy.crawler import CrawlerRunner

from googleshop.spiders.gshop import GshopSpider
from scrapy.utils.project import get_project_settings


class MyCrawlerRunner(CrawlerRunner):
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        crawler = self.create_crawler(crawler_or_spidercls)

        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        dfd = self._crawl(crawler, *args, **kwargs)

        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items


def return_spider_output(output):
    return json.dumps([dict(item) for item in output])


if __name__=="__main__"
    settings = get_project_settings()
    runner = MyCrawlerRunner(settings)
    spider = ElCorteIngles()
    deferred = runner.crawl(spider)
    deferred.addCallback(return_spider_output)
    return deferred

The easiest way to do this would be something like this:最简单的方法是这样的:

class ElCorteIngles(scrapy.Spider):
    name = 'ElCorteIngles'
    url = ''
    DEBUG = False

    def __init__(self):
        super().__init__(self, **kwargs)

        # Establish your db connection here. This can be any database connection.
        # Reuse this connection object anywhere else
        self.conn = conn = engine.connect()

    def start_requests(self):
        with self.conn.cursor() as cursor:
            cursor.execute('''SELECT * FROM Ruby.New_links WHERE url NOT NULL OR url != %s''', ('',))
            result = cursor.fetchall()
         for url in result:
             yield scrapy.Request(url=url, dont_filter=True, callback=self.parse)
    def parse(self):

        # Your Parse code here

After Doing this you can initiate this crawler using something like this完成此操作后,您可以使用以下内容启动此爬虫

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from project_name.spiders.filename import ElCorteIngles


process = CrawlerProcess(get_project_settings())
process.crawl(ElCorteIngles)
process.start()

Hope this helps.希望这可以帮助。

I would also recommend you to have a queue if you are working with a large number of URLs.如果您正在处理大量 URL,我还建议您使用队列。 This will enable multiple spider processes to work on these URLs in parallel.这将使多个蜘蛛进程能够并行处理这些 URL。 You can initiate the queue in the init method.您可以在init方法中启动队列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM