I'm using XMLfeedspider from Scrapy to extract information from a page xml. I'm trying to extract only the links that are on this page inside the tag "loc" and load them but blocking the page to redirect and then send it to the last parse node that will collect the information from this page. The problem is that I'm not sure if is posible to load these pages on the "def star_urls" or if I need to use the parse_node and redirect to another parse to extract the informations I need, but even if I try that I'm not sure how to extract just the links from the xml pages, and not all the loc tag.
Resuming my idea:
The idea should be load this xml page and extract links inside the <loc>
tags from it, like these:
https://www.gotdatjuice.com/track-2913133-sk-invitational-ft-sadat-x-lylit-all-one-cdq.html https://www.gotdatjuice.com/track-2913131-sk-invitational-ft-mop-we-dont-stop-cdq.html
Then finally load each of this pages and extract the title and url.
Any ideas?
Find below my code:
from scrapy.loader import ItemLoader
from scrapy.spiders import XMLFeedSpider
from scrapy.http import Request
from testando.items import CatalogueItem
class TestSpider(XMLFeedSpider):
name = "test"
allowed_domains = ["gotdajuice.ie"]
start_urls = [
'https://www.gotdatjuice.com/sitemap.xml'
]
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:loc'
iterator = 'xml'
name_path = ".//div[@class='song-name']/h1/text()"
def start_request(self):
urls = node.xpath(".//loc/text()").extract()
for url in urls:
yield scrapy.Request(
meta={'dont_redirect': True},
dont_filter=True,
url=url, callback=self.parse_node)
def parse_node(self, response, node):
l = ItemLoader(item=CatalogueItem(), response=response)
l.add_xpath('name', self.name_path)
l.add_value('url', response.url)
return l.load_item()
I don't understand your requirement to not redirect. Anyway, see below modified spider code:
import scrapy
from scrapy.loader import ItemLoader
from scrapy.spiders import XMLFeedSpider
from scrapy.http import Request
class TestSpider(XMLFeedSpider):
name = "test"
allowed_domains = ["gotdajuice.com"]
start_urls = [
'https://www.gotdatjuice.com/sitemap.xml'
]
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:loc'
iterator = 'xml'
name_path = ".//div[@class='song-name']/h1/text()"
def parse_node(self, response, node):
urls = node.xpath("./text()").extract()
for url in urls:
yield scrapy.Request(
meta={'dont_redirect': True},
dont_filter=True,
url=url, callback=self.parse_item)
def parse_item(self, response):
yield {
'name': response.xpath(self.name_path).extract_first(),
'url': response.url,
}
Modifications:
scrapy
module, as later in the code you use scrapy.Request
. allowed_domains
( .ie
to .com
) to reflect the actual domain you scrape. start_requests
contained what actually needs to be in parse_node
. Iteration over loc
elements is taken care by iterator
and itertag
settings of XMLFeedSpider
and the results are passed into parse_node
. The code inside then yields Request
s to item details being parsed in parse_item
. parse_item
just yields the item in dict
format, as I don't have access to your CatalogueItem
. You should use xmltodict
import xmltodict
def start_requests(self):
yield Request("https://www.gotdatjuice.com/sitemap.xml", callback=self.parse_sitemap)
def parse_sitemap(self,response):
obj = xmltodict.parse(response.body)
monString = json.dumps(obj)
json_data = json.loads(monString)
urls = json_data['urlset']['url']
for url in urls :
loc = url['loc']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.