简体   繁体   English

抓取抓取并跟踪href中的链接

[英]Scrapy crawl and follow links within href

I am very much new to scrapy. 我是新手。 I need to follow href from the homepage of a url to multiple depths. 我需要从网址的首页开始跟随href到多个深度。 Again inside the href links i've multiple href's. 再次在href链接中,我有多个href。 I need to follow these href until i reach my desired page to scrape. 我需要遵循这些href,直到到达所需的抓取页面。 The sample html of my page is: 我页面的示例html是:

Initial Page 初始页

<div class="page-categories">
 <a class="menu"  href="/abc.html">
 <a class="menu"  href="/def.html">
</div>

Inside abc.html 内部abc.html

<div class="cell category" >
 <div class="cell-text category">
 <p class="t">
  <a id="cat-24887" href="fgh.html"/>
</p>
</div>

I need to scrape the contents from this fgh.html page. 我需要从此fgh.html页面中抓取内容。 Could anyone please suggest me where to start from. 有人可以建议我从哪里开始。 I read about Linkextractors but could not find a suitable reference to begin with. 我阅读了有关Linkextractors的信息,但找不到适合的参考开始。 Thankyou 谢谢

From what I see, I can say that: 从我看来,我可以这样说:

  • URLs to product categories always end with .kat 产品类别的网址始终以.kat
  • URLs to products contain id_ followed by a set of digits 产品的网址包含id_后跟一组数字

Let's use this information to define our spider rules : 让我们使用此信息来定义我们的蜘蛛rules

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class CodeCheckspider(CrawlSpider):
    name = "code_check"

    allowed_domains = ["www.codecheck.info"]
    start_urls = ['http://www.codecheck.info/']

    rules = [
        Rule(LinkExtractor(allow=r'\.kat$'), follow=True),
        Rule(LinkExtractor(allow=r'/id_\d+/'), callback='parse_product'),
    ]

    def parse_product(self, response):
        title = response.xpath('//title/text()').extract()[0]
        print title

In other words, we are asking spider to follow every category link and to let us know when it crawls a link containing id_ - which would mean for us that we found a product - in this case, for the sake of an example, I'm printing the page title on the console. 换句话说,我们要求Spider遵循每个类别链接,并在爬网包含id_的链接时让我们知道-这对我们来说意味着我们找到了产品-在这种情况下,为了举例说明, m在控制台上打印页面标题。 This should give you a good starting point. 这应该为您提供一个良好的起点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM