簡體   English   中英

如何從 Python 腳本中運行 Scrapy

[英]How to run Scrapy from within a Python script

我是 Scrapy 的新手,我正在尋找一種從 Python 腳本運行它的方法。 我找到了 2 個解釋這一點的來源:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

我不知道我應該把我的爬蟲代碼放在哪里以及如何從主函數調用它。 請幫忙。 這是示例代碼:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
# 
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
# 
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue

class CrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()
        self.items = []
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def _crawl(self, queue, spider_name):
        spider = self.crawler.spiders.create(spider_name)
        if spider:
            self.crawler.queue.append_spider(spider)
        self.crawler.start()
        self.crawler.stop()
        queue.put(self.items)

    def crawl(self, spider):
        queue = Queue()
        p = Process(target=self._crawl, args=(queue, spider,))
        p.start()
        p.join()
        return queue.get(True)

# Usage
if __name__ == "__main__":
    log.start()

    """
    This example runs spider1 and then spider2 three times. 
    """
    items = list()
    crawler = CrawlerScript()
    items.append(crawler.crawl('spider1'))
    for i in range(3):
        items.append(crawler.crawl('spider2'))
    print items

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date  : Oct 24, 2010

謝謝你。

所有其他答案都參考 Scrapy v0.x。 根據更新后的文檔,Scrapy 1.0 要求:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

我們可以簡單地使用

from scrapy.crawler import CrawlerProcess
from project.spiders.test_spider import SpiderName

process = CrawlerProcess()
process.crawl(SpiderName, arg1=val1,arg2=val2)
process.start()

在具有全局作用域的 spider __init__函數中使用這些參數。

雖然我還沒有嘗試過,但我認為可以在scrapy 文檔中找到答案。 直接引用它:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

據我所知,這是圖書館的一項新發展,它使一些早期的在線方法(例如問題中的方法)過時了。

在 scrapy 0.19.x 你應該這樣做:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent

注意這些行

settings = get_project_settings()
crawler = Crawler(settings)

沒有它,您的蜘蛛將不會使用您的設置,也不會保存項目。 我花了一段時間才弄清楚為什么文檔中的示例沒有保存我的項目。 我發送了一個拉取請求來修復文檔示例。

另一種方法是直接從您的腳本中調用命令

from scrapy import cmdline
cmdline.execute("scrapy crawl followall".split())  #followall is the spider's name

從我在這里的第一個答案中復制了這個答案: https ://stackoverflow.com/a/19060485/1402286

當一個 python 腳本中需要運行多個爬蟲時,reactor stop 需要謹慎處理,因為 reactor 只能停止一次,不能重新啟動。

但是,我在做我的項目時發現使用

os.system("scrapy crawl yourspider")

是最簡單的。 這將使我免於處理各種信號,尤其是當我有多個蜘蛛時。

如果性能是一個問題,你可以使用 multiprocessing 來並行運行你的蜘蛛,比如:

def _crawl(spider_name=None):
    if spider_name:
        os.system('scrapy crawl %s' % spider_name)
    return None

def run_crawler():

    spider_names = ['spider1', 'spider2', 'spider2']

    pool = Pool(processes=len(spider_names))
    pool.map(_crawl, spider_names)

它是對Scrapy throws an error when run using crawlerprocess的改進

https://github.com/scrapy/scrapy/issues/1904#issuecomment-205331087

首先為成功的命令行運行創建你常用的蜘蛛。 它應該運行並導出數據或圖像或文件非常重要

結束后,就像粘貼在我的程序中蜘蛛類定義上方和 __name __ 下方一樣調用設置。

它將獲得許多人推薦的“from scrapy.utils.project import get_project_settings”無法做到的必要設置

上面和下面的部分應該在一起。 只有一個不跑。 Spider 將在 scrapy.cfg 文件夾中運行,而不是任何其他文件夾

版主可以展示樹狀圖供參考

#Tree
[enter image description here][1]

#spider.py
import sys
sys.path.append(r'D:\ivana\flow') #folder where scrapy.cfg is located

from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from flow import settings as my_settings

#----------------Typical Spider Program starts here-----------------------------

          spider class definition here

#----------------Typical Spider Program ends here-------------------------------

if __name__ == "__main__":

    crawler_settings = Settings()
    crawler_settings.setmodule(my_settings)

    process = CrawlerProcess(settings=crawler_settings)
    process.crawl(FlowSpider) # it is for class FlowSpider(scrapy.Spider):
    process.start(stop_after_crawl=True)
# -*- coding: utf-8 -*-
import sys
from scrapy.cmdline import execute


def gen_argv(s):
    sys.argv = s.split()


if __name__ == '__main__':
    gen_argv('scrapy crawl abc_spider')
    execute()

將此代碼放在可以從命令行運行scrapy crawl abc_spider的路徑中。 (使用 Scrapy==0.24.6 測試)

如果你想運行一個簡單的爬行,只需要運行命令就很容易:

爬行。 還有另一個選項可以將結果導出並以某些格式存儲,例如:Json、xml、csv。

scrapy crawl -o result.csv 或 result.json 或 result.xml。

你可能想試試

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM