简体   繁体   English

如何爬取多级页面到Scrapy中的一项?

[英]How to crawl multiple-level pages to one item in Scrapy?

All examples I found of Scrapy talk about how to crawl a single page, or how to crawl multiple-level pages, when each of the deepest pages is saved as an independent Item .我找到的 Scrapy 的所有例子都在讲如何爬取单个页面,或者如何爬取多级页面,当每个最深的页面都保存为一个独立的Item时。 But my situation is a bit more complex.但是我的情况有点复杂。

For example, the website structure is:例如,网站结构是:

A (List page of books)
--> B (Book summary page)
----> C (Book review pages)
----> D (Book download pages)

And so the definition of the Item looks like:因此Item的定义如下所示:

class BookItem(scrapy.Item):
    name = scrapy.Field()
    type = scrapy.Field()
    introduction = scrapy.Field()
    resources = scrapy.Field() # To be a list of ResourceItem
    reviews = scrapy.Field() # To be a list of ReviewItem

# Download pages
class ResourceItem(scrapy.Item):
    title = scrapy.Field()
    createDate = scrapy.Field()
    author = scrapy.Field()
    link = scrapy.Field()


# Book reviews
class ReviewItem(scrapy.Item):
    title = scrapy.Field()
    createDate = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

How could I complete all fields of BookItem ?我怎样才能完成BookItem的所有字段? I do know that I can write 4 methods, like parse_A() , parse_B() , parse_C() and parse_D() , and Scrapy allows them to be a workflow by using yield scrapy.Request() at the end of each methods.知道我可以编写 4 种方法,例如parse_A()parse_B()parse_C()parse_D() ,而 Scrapy 通过在每个方法的末尾使用yield scrapy.Request()允许它们成为一个工作流。

But what should I return in the deepest methods, ie parse_C() and parse_D() ?但是我应该在最深层的方法中返回什么,即parse_C()parse_D()

  1. If I return a ResourceItem or ReviewItem , it will be saved directly.如果我返回ResourceItemReviewItem ,它将被直接保存。
  2. If I return the BookItem from upper methods, the uncompleted item will be saved directly, too.如果我从上层方法返回BookItem ,未完成的项目也将直接保存。
  3. If I return a Request for parse_D() in parse_C() , it will not work either, because resources may be empty (that is to say, there may be no links of C on B pages at all).如果我在parse_C() parse_D()Request ,它也不会起作用,因为资源可能是空的(也就是说 B 页面上可能根本没有 C 的链接)。 So parse_C() won't be called, leaving parse_D() uncalled, and D fields unfilled in the end.所以parse_C()不会被调用,留下parse_D()未被调用,最后 D 字段未填充。

You can pass some data around using meta parameters (see https://docs.scrapy.org/en/latest/topics/request-response.html ).您可以使用meta参数传递一些数据(请参阅https://docs.scrapy.org/en/latest/topics/request-response.html )。

So you can populate your item in multiple requests/parse functions.因此,您可以在多个请求/解析函数中填充您的项目。

Quick example to show the logic:显示逻辑的快速示例:

def parse_summary(self, response):
  book_item = # scrape book item here
  reviews_url = # extract reviews url 
  resources_url = # extract resources url
  return scrapy.Request(reviews_url, callback=self.parse_reviews, meta={'item': book_item, 'resources_url': resources_url })

def parse_reviews(self, response):
  book_item = response.meta.get('item') # get item draft
  book_item.reviews = # extract reviews here
  resources_url = response.meta.get('resources_url')
  return scrapy.Request(resources_url, callback=self.parse_resources, meta={'item': book_item })

def parse_resources(self, response):
  book_item = response.meta.get('item') # get item draft
  book_item.ressources = # extract ressources here
  return book_item # once completed, return the item

Hope you get the idea (I'm not really confident on the code execution, just wrote it down without testing).希望你能明白(我对代码执行不是很自信,只是写下来没有测试)。

I can answer it by myself now.我现在可以自己回答了。

Just yield None or omit return statements in parse_C() and parse_D() will solve the problem.只需yield None省略parse_C()parse_D()中的 return 语句即可解决问题。

Some explanation一些解释

Scrapy will not close the spider simply because one of the callbacks returns nothing, but ensure that there is no new one in the request queue as well. Scrapy 不会仅仅因为其中一个回调什么都不返回就关闭蜘蛛,但要确保请求队列中也没有新的回调。

So, since parse_B() will not return None or Item before it completes yielding all of its requests of Subpage C & D, the workflow won't be interrupted.因此,由于parse_B()在完成产生子页面 C & D 的所有请求之前不会返回NoneItem ,因此工作流不会中断。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM