简体   繁体   中英

scrapy multiple item classes with extract method inside them

Just to state: I am not an experienced programmer,don't be mad at me … I am exploring scrapy possibilities (I have some Python programming skills).

Scrapping a website: lets imagine that we can have some informations extracted from opengraph(og:) like 'title', 'url' and 'description' , and other informations from schema.org , like 'author' , and last we want 'title', 'url', 'description' and 'date' that can be extracted from "normal" XPath from the HTML just if there is not available from opengraph(og:) and schema.org .

I create 3 item classes OpengraphItem(Item), SchemaItem(Item) and MyItem(Item) , in separated .py files. Inside each class there would be an extract function to extract the fields, like this example:

class OpengraphItem(Item):
      title = Field()
      url = Field()
      description = Field()

      def extract(self, hxs):
            self.title = hxs.xpath('/html/head/meta[@property="og:title"]/@content').extract()
            self.url = hxs.xpath('/html/head/meta[@property="og:url"]/@content').extract()
            self.description = hxs.xpath('/html/head/meta[@property="og:description"]/@content').extract()

Then in the spider code , the extract function will be called like this:

def parse_item(self, response):
    hxs = HtmlXPathSelector(response)

    my_item = MyItem()
    item_opengraph = OpengraphItem()
    item_opengraph.extract(hxs)

     item_schema = SchemaItem()
     item_schema.extract(hxs)

      my_item['date']= hxs.xpath('/html/body//*/div[@class="reviewDate"]/span/time[@class="dtreviewed"]/@content').extract()

      my_item['title'] = item_opengraph.get('title', None)
      my_item['url'] = item_opengraph.get('url', None)
      my_item['description'] = item_opengraph.get('description', None)

      if my_item['url'] == None:
            my_item['url'] = response.url

      if my_item['title'] == None:
            my_item['title'] = hxs.xpath('/html/head/title/text()').extract()

      if my_item['description'] == None:
            my_item['description'] = hxs.xpath('/html/head/meta[@name="description"]/@content').extract()

      return my_item

Does this make any sense? It is rigth to have the created extract method inside items class?

I took a look into this other questions: scrapy crawler to pass multiple item classes to pipeline - and I dunno if it is correct to have only one items.py with multiple and different classes inside it.

Scrapy item extraction scope issue and scrapy single spider to pass multiple item classes to pipeline - should I have an Itempipeline? I am not familiar with those, but in scrapy documentation says its uses and I think it not fit this problem. And Item loaders?

I ommited some parts of the code.

It is rigth to have the created extract method inside items class?

That's very unusual. I can't say it's "not right", as the code will still work, but usually all the code related to page structure (such as selectors) stays in the Spider.

Item loaders might be useful for what you're trying to do, you should definitely give it a try.

Another thing, attribute assignment to item fields like

  def extract(self, hxs):
        self.title = hxs [...]

will not work. Scrapy Items are like dicts, you should instead assign to eg self['title'] .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM