简体   繁体   English

运行多个Scrapy Spiders(简单方法)Python

[英]Running Multiple Scrapy Spiders (the easy way) Python

Scrapy is pretty cool, however I found the documentation to very bare bones, and some simple questions were tough to answer. Scrapy非常酷,但我发现文档非常简单,一些简单的问题很难回答。 After putting together various techniques from various stackoverflows I have finally come up with an easy and not overly technical way to run multiple scrapy spiders. 在汇总各种堆栈溢出的各种技术之后,我终于提出了一种简单而不过分技术的方法来运行多个scrapy蜘蛛。 I would imagine its less technical than trying to implement scrapyd etc: 我认为它的技术性不如试图实现scrapyd等:

So here is one spider that works well at doing it's one job of scraping some data after a formrequest: 所以这里有一只蜘蛛可以很好地完成这项工作,它可以在表单请求之后抓取一些数据:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import FormRequest
from swim.items import SwimItem

class MySpider(BaseSpider):
    name = "swimspider"
    start_urls = ["swimming website"]

    def parse(self, response):
        return [FormRequest.from_response(response,formname="AForm",
                    formdata={"lowage": "20, "highage": "25"}
                    ,callback=self.parse1,dont_click=True)]

    def parse1(self, response):       
        #open_in_browser(response)
        hxs = Selector(response)
        rows = hxs.xpath(".//tr")
        items = []

        for rows in rows[4:54]:
            item = SwimItem()
            item["names"] = rows.xpath(".//td[2]/text()").extract()
            item["age"] = rows.xpath(".//td[3]/text()").extract()
            item["swimtime"] = rows.xpath(".//td[4]/text()").extract()
            item["team"] = rows.xpath(".//td[6]/text()").extract()
            items.append(item)           
        return items

Instead of deliberately writing out the formdata with the form inputs I wanted ie "20" and "25: 而不是故意用表格输入写出形式数据,我想要的是“20”和“25:

formdata={"lowage": "20", "highage": "25}

I used "self." 我用的是“自我”。 + a variable name: +变量名称:

formdata={"lowage": self.lowage, "highage": self.highage}

This then allows you to call the spider from the command line with the arguments that you want (see below). 然后,这允许您使用所需的参数从命令行调用spider(请参见下文)。 Use the python subprocess call() function to call those very command lines one after another, easily. 使用python subprocess call()函数可以轻松地一个接一个地调用这些命令行。 It means I can go to my commandline, type "python scrapymanager.py" and have all of my spiders do their thing, each with different arguments passed at their command line, and download their data to the correct place: 这意味着我可以转到我的命令行,输入“python scrapymanager.py”并让我的所有蜘蛛都做他们的事情,每个蜘蛛都在他们的命令行传递不同的参数,并将他们的数据下载到正确的位置:

#scrapymanager

from random import randint
from time import sleep
from subprocess import call

#free
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='10025' -o free.json -t json"], shell=True)
sleep(randint(15,45))

#breast
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='30025' -o breast.json -t json"], shell=True)
sleep(randint(15,45))

#back
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='20025' -o back.json -t json"], shell=True)
sleep(randint(15,45))

#fly
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='40025' -o fly.json -t json"], shell=True)
sleep(randint(15,45))

So rather than spending hours trying to rig up a complicated single spider that crawls each form in succession (in my case different swim strokes), this is a pretty painless way to run many many spiders "all at once" (I did include a delay between each scrapy call with the sleep() functions). 因此,而不是花费数小时试图装配一个复杂的单个蜘蛛,连续爬行每个形式(在我的情况下不同的游泳笔画),这是一个非常轻松的方式来“同时”运行许多蜘蛛(我确实包括延迟每个scrapy调用与sleep()函数之间)。

Hopefully this helps someone. 希望这有助于某人。

Here it is the easy way. 这是一个简单的方法。 you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) : 你需要使用scrapy.cfg将此代码保存在同一目录中(我的scrapy版本是1.3.3):

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spiders.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy

process.start()

and run it. 并运行它。 thats it! 而已!

yes there is an excellent companion to scrapy called scrapyd that's doing exactly what you are looking for, among many other goodies, you can also launch spiders through it, like this: 是的,有一个名为scrapyd的scrapy的优秀伴侣正在做你正在寻找的东西,在许多其他好东西中,你也可以通过它启动蜘蛛,如下所示:

$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2
{"status": "ok", "jobid": "26d1b1a6d6f111e0be5c001e648c57f8"}

you can add your custom parameters as well using -d param=123 您也可以使用-d param=123添加自定义参数

btw, spiders are being scheduled and not launched cause scrapyd manage a queue with (configurable) max number of running spiders in parallel 顺便说一句,蜘蛛正在scheduled而没有launched导致scrapyd管理一个队列,其中(可配置)并行运行的蜘蛛最大数量

Your method makes it procedural which makes it slow, against Scrapy's main principal. 对于Scrapy的主要原则,你的方法使得程序变得缓慢。 To make it asynchronous as always, you can try using CrawlerProcess 为了使它像往常一样异步,您可以尝试使用CrawlerProcess

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

from myproject.spiders import spider1, spider2

1Spider = spider1.1Spider()
2Spider = spider2.2Spider()
process = CrawlerProcess(get_project_settings())
process.crawl(1Spider)
process.crawl(2Spider)
process.start()

If you want to see the full log of the crawl, set LOG_FILE in your settings.py . 如果要查看爬网的完整日志,请在settings.py设置LOG_FILE

LOG_FILE = "logs/mylog.log"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM