简体   繁体   English

抓取抓取时具有相同内容(例如/ product)的抓取网址

[英]Crawl urls with same content (ex. /product) when crawling with scrapy

How can I crawl multiple pages with same condition with scrapy?Example: I want to identify all the product pages in an eCommerce site that doesn´t contains product photo (or something) 如何在具有相同条件的情况下抓取多个页面?示例:我想在一个不包含产品照片(或其他内容)的电子商务站点中识别所有产品页面

class SomewebsiteProductSpider(scrapy.Spider):
name = "test"
allowed_domains = ["test.com"]

start_urls = [test.com/product] start_urls = [test.com/product]

In many eamples I´ve seen the start url always correspond to a single page. 在许多示例中,我已经看到起始URL始终对应于单个页面。

It´s possible? 这是可能的? Thanks! 谢谢!

If you want to identify all items of a webpage it's good practice to start with one page -- typically the main-page of the site -- and start the crawling from there. 如果要标识网页的所有项目,则最好从一个页面(通常是网站的主页)开始,然后从那里开始爬网。 You want to use the page of the site where all categories are listed that you are interested in. 您想要使用列出感兴趣的所有类别的网站页面。

With scrapy you can define which links the spider should follow and which pages it should parse and return information to you. 使用scrapy,您可以定义Spider应该遵循的链接以及应当解析的页面并向您返回信息。

So it's possible and scrapy is a great tool for that. 因此,有可能,scrapy是一个很好的工具。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM