[英]How to follow links in Scrapy if there is no href?
I am trying to follow links in Scrapy when I already parsed one page and extract information from there.当我已经解析了一页并从那里提取信息时,我正在尝试关注 Scrapy 中的链接。 The problem is, webpage has no href, so I can't just follow it with ease.问题是,网页没有href,所以我不能轻松地跟随它。 I have managed to expand my XPath query with @data-param and finally got something: page=2 .我设法用@data-param扩展了我的 XPath 查询,最后得到了一些东西: page=2 。
The problem is I am not sure how to follow this link as I want to pass listName["listLinkMaker"]
to my URL generator or composer.问题是我不确定如何访问此链接,因为我想将listName["listLinkMaker"]
传递给我的 URL 生成器或作曲家。
Should I make another "def" and call it say, def parse_pagination to follow links?我应该再做一个“def”并称它为 def parse_pagination 来跟踪链接吗?
JSON used in code is really simple:代码中使用的 JSON 非常简单:
[
{"storeName": "Interspar", "storeLinkMaker": "https://popusti.njuskalo.hr/trgovina/Interspar"}
]
Code below:下面的代码:
# -*- coding: utf-8 -*-
import scrapy
import json
class LoclocatorTestSpider(scrapy.Spider):
name = "loclocator_test"
start_urls = []
with open("test_one_url.json", encoding="utf-8") as json_file:
data = json.load(json_file)
for store in data:
storeName = store["storeName"]
storeLinkUrl = store["storeLinkMaker"]
start_urls.append(storeLinkUrl)
def parse(self, response):
selector = "//div[@class='mainContentWrapInner cf']"
store_name_selector = ".//h1[@class='title']/text()"
store_branches_selector = ".//li/a[@class='xiti']/@href"
for basic_info in response.xpath(selector):
store_branches = {}
store_branches["storeName"] = basic_info.xpath(store_name_selector).extract_first()
# This specific XPath extracts 1st part of link needed to crawl all of store branches
store_branches["storeBranchesLink"] = basic_info.xpath(store_branches_selector).extract_first() + "?"
store_branches_url = basic_info.xpath(store_branches_selector).extract_first()
yield response.follow(store_branches_url, self.parse_pagination, meta={"store_branches": store_branches})
def parse_branches(self, response):
store_branches_name_selector = "//li[@class='xiti']"
store_branches = response.meta["store_branches"]
for store_branch in response.xpath(store_branches_name_selector):
store_branches["storeBranchName"] = store_branch.xpath(".//span[@class='title']/text()").extract_first()
yield store_branches
# This specific XPath extracts 2nd part of link needed to crawl all of store branches
# URL should look like: https://popusti.njuskalo.hr/trgovina/Interspar?page=n where n>0
links = response.selector.xpath("//li[@class='next']/button[@class='nBtn link xiti']/@data-param").extract()
for link in links:
absolute_url = #LIST FROM FIRST PARSE (ie. store_branches["storeBranchesLink"]) + link
yield scrapy.Request(absolute_url, callback=self.parse_branches)
Thank you.谢谢你。
I managed to find a solution by myself and I was relatively close to the solution.我设法自己找到了解决方案,并且我相对接近解决方案。
Under the part:下部分:
# This specific XPath extracts 2nd part of link needed to crawl all of store branches
# URL should look like: https://popusti.njuskalo.hr/trgovina/Interspar?page=n where n>0
links = response.selector.xpath("//@data-param").extract()
store_branches = response.meta["store_branches"]
for link in links:
absolute_url = store_branches["storeBranchesLink"]) + link
yield scrapy.Request(absolute_url, callback=self.parse_branches)
I believe the solution was to add response from store_branches as it was than able to find all possible pages (?page=n where n>0).我相信解决方案是添加来自 store_branches 的响应,因为它能够找到所有可能的页面(?page=n where n>0)。 If anyone knows more technical information as my understand of code is relatively rudimentary, please be sure to answer.如果有人知道更多的技术信息,因为我对代码的理解比较初级,请务必回答。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.