简体   繁体   English

Scrapy不关注parse_dir_contents回调上的链​​接

[英]Scrapy not following links on parse_dir_contents callback

I am having trouble getting my spider to follow links. 我无法让我的蜘蛛追踪链接。 I've gone over the Scrapy tutorial many times and have searched quite a bit, but am still confused. 我浏览了Scrapy教程很多次,并进行了很多搜索,但仍然感到困惑。

For some reason, even though there are hundreds of results spread over about 15-20 pages, my spider always returns 5 - 7 results and says it is done. 出于某种原因,即使在约15-20页上分布了数百个结果,我的蜘蛛也总是返回5-7个结果并说完成了。

I've placed some print statements both right before my parse_dir_contents method is called and right as it begins running. 我在parse_dir_contents方法被调用之前以及开始运行时都放置了一些打印语句。 For some reason, it is called 40 times (in two sets of 20), and only runs 5 - 7 times. 由于某种原因,它被称为40次(分为两组,每组20次),并且仅运行5至7次。 I have about 20 results per page, and if I print out the URL that it's navigating to each time, it never makes it past page 1. 每页大约有20个结果,如果我每次打印出要导航的URL,它就永远不会超过第1页。

I'm sure that there are a lot of things I could do better in this code. 我敢肯定,在这段代码中我可以做很多事情。 Any help whatsoever would really be appreciated. 任何帮助将不胜感激。 I've really been working hard to make this work. 我真的一直在努力使这项工作。

There's a fair amount of "helper" code in here which really clutters things up. 这里有很多“帮助程序”代码,它们确实使事情变得混乱。 My apologies, but I wanted to give you the exact code I'm using to get the best solution. 抱歉,但我想为您提供用于获得最佳解决方案的确切代码。

There are a number of "vip" listings on each page that duplicate for each page. 每个页面上有许多“ vip”清单,每个清单都重复。 So I just wanted to scrape those once and not have them factor into the numPages calculation. 所以我只想刮擦一次,而不将它们作为numPages计算的因素。

It's really hard to point the problem because I can't reproduce the error with the code you provided. 很难指出问题所在,因为我无法使用您提供的代码重现该错误。 I don't know exactly what the problem is with your code but I can give you some tips on improving your code: 我不知道您的代码到底出了什么问题,但是我可以为您提供一些改进代码的提示:

for regularListingContainer in body.xpath('//div[@class="search-item regular-ad"]'):
        link = str(regularListingContainer.re('href="(.*)" class="title">'))

You can call multiple time the xpath or css selector, when scraping it's faster to stick to the scrapy library you can do body.xpath().xpath().css() to get the string you just extract() it 您可以多次调用xpath或css选择器,在进行抓取时,坚持使用scrapy库更快,您可以执行body.xpath().xpath().css()来获取只extract()的字符串

for regularListingContainer in body.xpath('//div[@class="search-item regular-ad"]'):
        link = regularListingContainer.xpath('a[contains(@class, "title")]/@href').extract_first()

When handling links most times it's better to do urljoin() , let scrapy do the heavy lifting and handle relative paths or absolute paths 大多数时候处理链接时,最好执行urljoin() ,让scrapy做繁重的工作并处理相对路径或绝对路径

link = regularListingContainer.xpath('a[contains(@class, "title")]/@href').extract_first()
yield Request(urljoin(link), callback=self.parse_dir_contents)

Scrapy uses multithreading meaning everytime you yield anything, it opens up a thread and makes it run asynchronously with each other. Scrapy使用多线程技术,这意味着每当您产生任何东西时,它都会打开一个线程并使其彼此异步运行。 This means that you have no control over the flow which code gets executed first. 这意味着您无法控制首先执行哪个代码的流程。 My best bet would be that your global variables are not changing how you think they are. 最好的选择是全局变量不会改变您的想法。

To solve this you can use the meta[] tag to exchange information between the threads for example 为了解决这个问题,您可以使用meta[]标签在线程之间交换信息,例如

link = regularListingContainer.xpath('a[contains(@class, "title")]/@href').extract_first()
request=Request(urljoin(link), callback=self.anotherFunc)
request['string']="I'm going on a journy"
yield request

def anotherFunc(self, response)
   foo=response['string']
   print foo

This will output 这将输出

I'm going on a journy

Hope I helped, feel free to ask further 希望我有所帮助,随时询问

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM