简体   繁体   English

在Scrapy中使用递归蜘蛛[Python]

[英]Using recursion Spider in Scrapy [Python]

Intro Hello i'm using Scrapy in order to take data in Yahoo Answer. 简介您好,我正在使用Scrapy以便在Yahoo Answer中获取数据。 My achievement is to take all the question and the answer in one precise section. 我的成就是将所有问题和答案都放在一个精确的部分中。

I try Using scrapy and selenium first of all i try to take a list of any question in a section, this list is memorize in the Spider Class. 我首先尝试使用Scrapy和Selenium,然后尝试在一个小节中列出任何问题,该列表是在Spider类中记住的。 After i use a For loop in order to parse every single page. 之后,我使用For循环来解析每个页面。

 for url in self.start_urls_mod:
        yield scrapy.Request(url, callback=self.parse_page)
        i = i + 1

the method parse_page is structured for scrapying the question page, the best answer and all the other answer. parse_page方法的结构用于抓取问题页面,最佳答案和所有其他答案。 This works fine. 这很好。

The problem comes when i try to go on "Next" question using the href in the "next" link present on following link on the right side of the page. 问题是当我尝试使用HREF去“下一步”问题中的“下一个”链接出现在页面的右侧下面的链接。 I call again the same function parse_page, passing the url take from that link. 我再次调用相同的函数parse_page,传递来自该链接的网址。 Sometimes this work but other times no. 有时这项工作可行,但其他时候则没有。 I don't now if is correct to call two times the parse_page function, without using any base case in other to stop the recursion its stop anyway. 我现在不知道两次调用parse_page函数是否正确,而在其他情况下不使用任何基本情况来停止递归,无论如何都要停止它。

The program works without any error and stops, but i don't find any question in the "next" section. 该程序工作没有任何错误并停止了,但是我在“下一个”部分中找不到任何问题。 Only someone. 只有一个

There is a snippet of my code. 我的代码有一段。

    def parse_page(self, response):
    #Scraping with xpath things that interests me
    #Go to the next similar question
    next_page = hxs.xpath('((//a[contains(@class,"Clr-b")])[3])/@href').extract()
    composed_string = "https://answers.yahoo.com" + next_page[0]
    print("NEXT -> "+str(composed_string))
    yield scrapy.Request(urljoin(response.url, composed_string), callback=self.parse_page)

ps. ps。 I would use a crowl spider, but i can't define any rules to take only this type of question. 我会用皱巴巴的蜘蛛,但我无法定义任何规则来仅接受此类问题。 So please how i can improve my function. 所以请我如何改善我的功能。

Infos: https://answers.yahoo.com/question/index?qid=20151008101821AAuHgCk 信息: https ://answers.yahoo.com/question/index qid = 20151008101821AAuHgCk

First of all your XPath for selecting the next URL is wrong. 首先,您用于选择下一个URL的XPath是错误的。 You will always obtain the third URL with "Clr-b" which can be wrong (it does not exist or it is not the next site). 您将始终获得带有“ Clr-b”的第三个URL,这可能是错误的(它不存在或不是下一个站点)。

For such queries I would use the text search. 对于此类查询,我将使用文本搜索。 In your case something like this: 在您的情况下,如下所示:

next_page = response.xpath('//a[contains(@class,"Clr-b") and text()=" Next "]/@href').extract()

Then you compose your URL as you do and you do not have to use urljoin . 然后,您可以按自己的方式urljoin URL,而不必使用urljoin This is not needed because you already have the right URL which you need to yield as you do. 这不是必要的,因为你已经有了,你需要正确的URL yield为你做。 This is probably the cause why your spider stops: you generate a URL with urljoin which is not found -- and this is not the URL you print to the console. 这可能是您的Spider停止的原因:您生成了一个带有urljoin的URL,该URL找不到-这不是您打印到控制台的URL。

And it is no problem to use the same function as callback. 并且使用与回调相同的功能也没有问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM