简体   繁体   English

Scrapy 是单线程还是多线程?

[英]Is Scrapy single-threaded or multi-threaded?

There are few concurrency settings in Scrapy, like CONCURRENT_REQUESTS . Scrapy 中的并发设置很少,例如CONCURRENT_REQUESTS Does it mean, that Scrapy crawler is multi-threaded?这是否意味着,Scrapy 爬虫是多线程的? So if I run scrapy crawl my_crawler it will literally fire multiple simultaneous requests in parallel?因此,如果我运行scrapy crawl my_crawler它实际上会并行触发多个同时请求? Im asking because, I've read that Scrapy is single-threaded.我问是因为,我读过 Scrapy 是单线程的。

Scrapy is single-threaded, except the interactive shell and some tests, see source . Scrapy 是单线程的,除了交互式 shell 和一些测试,请参阅 源代码

It's built on top of Twisted , which is single-threaded too, and makes use of it's own asynchronous concurrency capabilities, such as twisted.internet.interfaces.IReactorThreads.callFromThread , see source .它建立在Twisted之上,它也是单线程的,并利用了它自己的异步并发功能,例如twisted.internet.interfaces.IReactorThreads.callFromThread ,请参阅源代码

Scrapy does most of it's work synchronously. Scrapy 同步完成大部分工作。 However, the handling of requests is done asynchronously.但是,请求的处理是异步完成的。

I suggest this page if you haven't already seen it.如果您还没有看过,我建议您使用此页面。

http://doc.scrapy.org/en/latest/topics/architecture.html http://doc.scrapy.org/en/latest/topics/architecture.html

edit: I realize now the question was about threading and not necessarily whether it's asynchronous or not.编辑:我现在意识到问题是关于线程的,不一定是异步的。 That link would still be a good read though :)该链接仍然是一个很好的阅读:)

regarding your question about CONCURRENT_REQUESTS.关于您关于 CONCURRENT_REQUESTS 的问题。 This setting changes the number of requests that twisted will defer at once.这个设置改变了twisted一次延迟的请求数。 Once that many requests have been started it will wait for some of them to finish before starting more.一旦启动了这么多请求,它将等待其中一些请求完成,然后再开始更多请求。

Scrapy is single-threaded framework, we cannot use multiple threads within a spider at the same time. Scrapy 是单线程框架,我们不能在一个蜘蛛中同时使用多个线程。 However, we can create multiple spiders and piplines at the same time to make the process concurrent.但是,我们可以同时创建多个蜘蛛和管道以使进程并发。 Scrapy does not support multi-threading because it is built on Twisted , which is an Asynchronous http protocol framework . Scrapy 不支持multi-threading ,因为它建立在Twisted之上,这是一个Asynchronous http protocol framework

Scrapy is a single-threaded framework, But we can use Multiple threads within a spider at the same time. Scrapy 是一个单线程框架,但是我们可以在一个蜘蛛中同时使用多个线程

please read this article.请阅读这篇文章。

https://levelup.gitconnected.com/how-to-run-scrapy-spiders-in-your-program-7db56792c1f7#:~:text=We%20use%20the%20CrawlerProcess%20class,custom%20settings%20for%20the%20Spider https://levelup.gitconnected.com/how-to-run-scrapy-spiders-in-your-program-7db56792c1f7#:~:text=We%20use%20the%20CrawlerProcess%20class,custom%20settings%20for%20the %20蜘蛛

We can use subprocess to run spiders.我们可以使用子进程来运行蜘蛛。

import subprocess
subprocess.run(["scrapy", "crawl", "quotes", "-o", "quotes_all.json"])

or或者

Use CrawlerProcess to run multiple spiders in the same process.使用CrawlerProcess在同一进程中运行多个蜘蛛。

If you want to run multiple spiders per process or want to fetch and use the scraped items directly in your program, you would need to use the internal API of Scrapy.如果您想在每个进程中运行多个爬虫,或者想直接在程序中获取和使用抓取的项目,则需要使用 Scrapy 的内部 API。

    # Run the spider with the internal API of Scrapy:
    from scrapy.crawler import Crawler, CrawlerProcess
    from scrapy.utils.project import get_project_settings

def crawler_func(spider, url):
    crawler_process = CrawlerProcess(settings)
    crawler_process.crawl(spider, url)
    crawler_process.start()

def start_spider(spider, urls):
      p = multiprocessing.Pool(100)
      return p.map(partial(crawler_func, spider), urls)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM