简体   繁体   中英

How to extract urls from an XML page, load them and extract information inside them using Scrapy - XMLfeedspider?

I'm using XMLfeedspider from Scrapy to extract information from a page xml. I'm trying to extract only the links that are on this page inside the tag "loc" and load them but blocking the page to redirect and then send it to the last parse node that will collect the information from this page. The problem is that I'm not sure if is posible to load these pages on the "def star_urls" or if I need to use the parse_node and redirect to another parse to extract the informations I need, but even if I try that I'm not sure how to extract just the links from the xml pages, and not all the loc tag.

Resuming my idea:

The idea should be load this xml page and extract links inside the <loc> tags from it, like these:

https://www.gotdatjuice.com/track-2913133-sk-invitational-ft-sadat-x-lylit-all-one-cdq.html https://www.gotdatjuice.com/track-2913131-sk-invitational-ft-mop-we-dont-stop-cdq.html

Then finally load each of this pages and extract the title and url.

Any ideas?

Find below my code:

from scrapy.loader import ItemLoader
from scrapy.spiders import XMLFeedSpider
from scrapy.http import Request
from testando.items import CatalogueItem

class TestSpider(XMLFeedSpider):

    name = "test"
    allowed_domains = ["gotdajuice.ie"]
    start_urls = [      
        'https://www.gotdatjuice.com/sitemap.xml'
    ]   

    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'n:loc'
    iterator = 'xml'


    name_path = ".//div[@class='song-name']/h1/text()"


    def start_request(self):
      urls = node.xpath(".//loc/text()").extract()
      for url in urls:
          yield scrapy.Request(
            meta={'dont_redirect': True},
            dont_filter=True,
            url=url, callback=self.parse_node)

    def parse_node(self, response, node):

        l = ItemLoader(item=CatalogueItem(), response=response)
        l.add_xpath('name', self.name_path)
        l.add_value('url', response.url)
        return l.load_item()

I don't understand your requirement to not redirect. Anyway, see below modified spider code:

import scrapy
from scrapy.loader import ItemLoader
from scrapy.spiders import XMLFeedSpider
from scrapy.http import Request

class TestSpider(XMLFeedSpider):
    name = "test"
    allowed_domains = ["gotdajuice.com"]
    start_urls = [
        'https://www.gotdatjuice.com/sitemap.xml'
    ]

    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'n:loc'
    iterator = 'xml'

    name_path = ".//div[@class='song-name']/h1/text()"

    def parse_node(self, response, node):
      urls = node.xpath("./text()").extract()
      for url in urls:
          yield scrapy.Request(
            meta={'dont_redirect': True},
            dont_filter=True,
            url=url, callback=self.parse_item)

    def parse_item(self, response):
        yield {
            'name': response.xpath(self.name_path).extract_first(),
            'url': response.url,
        }

Modifications:

  1. Imported scrapy module, as later in the code you use scrapy.Request .
  2. Changed allowed_domains ( .ie to .com ) to reflect the actual domain you scrape.
  3. Your start_requests contained what actually needs to be in parse_node . Iteration over loc elements is taken care by iterator and itertag settings of XMLFeedSpider and the results are passed into parse_node . The code inside then yields Request s to item details being parsed in parse_item .
  4. parse_item just yields the item in dict format, as I don't have access to your CatalogueItem .

You should use xmltodict

import xmltodict
def start_requests(self):
    yield Request("https://www.gotdatjuice.com/sitemap.xml", callback=self.parse_sitemap)
def parse_sitemap(self,response):

obj = xmltodict.parse(response.body)
monString = json.dumps(obj)
json_data = json.loads(monString)

urls = json_data['urlset']['url']
for url in urls :
    loc = url['loc']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM