简体   繁体   中英

Python Scrapy - How to scrape from 2 different website at the same time?

I need to scrape data from a list of domain given in Excel; The problem is that I need to scrape data from the original website (let's take for example : https://www.lepetitballon.com ) and data from similartech ( https://www.similartech.com/websites/lepetitballon.com ).

I want them to scrape at the same time so I could receive them and format them once at the end, after that i'll just go to the next domain.

Theoretically, I should just use 2 spiders in an asynchronous way with scrapy?

Ideally you would want to keep spiders which scrape differently structured sites separate, that way your code will be a lot easier to maintain in the long run.

Theoretically, if, for some reason you MUST parse them in the same spider, you could just collect the URLs you want to scrape and based on the base path you could invoke different parser callback methods. That being said, I personally cannot think of a reason why you would have to do that. Even if you would have the same structure, you can just reuse your scrapy.Item classes.

Twisted networking library is used by the scrapy framework for its internal networking tasks, and the scrapy has provided to handle the concurrent requests in settings.

Explained here: https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests

Or you could use multiple spider which are independent to each others which is already explained in scrapy docs, this might be what you are looking for.

By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API.

https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

As per the efficiency you could choose either option A or B, this really depends upon your resources and requirements whereas option A can be good for lower resources with decent speed or option B can be ideal for better speed with higher resources consumption than option A.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM