Scrapy not following links on parse_dir_contents callback

Question

I am having trouble getting my spider to follow links. I've gone over the Scrapy tutorial many times and have searched quite a bit, but am still confused.

For some reason, even though there are hundreds of results spread over about 15-20 pages, my spider always returns 5 - 7 results and says it is done.

I've placed some print statements both right before my parse_dir_contents method is called and right as it begins running. For some reason, it is called 40 times (in two sets of 20), and only runs 5 - 7 times. I have about 20 results per page, and if I print out the URL that it's navigating to each time, it never makes it past page 1.

I'm sure that there are a lot of things I could do better in this code. Any help whatsoever would really be appreciated. I've really been working hard to make this work.

There's a fair amount of "helper" code in here which really clutters things up. My apologies, but I wanted to give you the exact code I'm using to get the best solution.

There are a number of "vip" listings on each page that duplicate for each page. So I just wanted to scrape those once and not have them factor into the numPages calculation.

Answer 1

It's really hard to point the problem because I can't reproduce the error with the code you provided. I don't know exactly what the problem is with your code but I can give you some tips on improving your code:

for regularListingContainer in body.xpath('//div[@class="search-item regular-ad"]'):
        link = str(regularListingContainer.re('href="(.*)" class="title">'))

You can call multiple time the xpath or css selector, when scraping it's faster to stick to the scrapy library you can do body.xpath().xpath().css() to get the string you just extract() it

for regularListingContainer in body.xpath('//div[@class="search-item regular-ad"]'):
        link = regularListingContainer.xpath('a[contains(@class, "title")]/@href').extract_first()

When handling links most times it's better to do urljoin() , let scrapy do the heavy lifting and handle relative paths or absolute paths

link = regularListingContainer.xpath('a[contains(@class, "title")]/@href').extract_first()
yield Request(urljoin(link), callback=self.parse_dir_contents)

Scrapy uses multithreading meaning everytime you yield anything, it opens up a thread and makes it run asynchronously with each other. This means that you have no control over the flow which code gets executed first. My best bet would be that your global variables are not changing how you think they are.

To solve this you can use the meta[] tag to exchange information between the threads for example

link = regularListingContainer.xpath('a[contains(@class, "title")]/@href').extract_first()
request=Request(urljoin(link), callback=self.anotherFunc)
request['string']="I'm going on a journy"
yield request

def anotherFunc(self, response)
   foo=response['string']
   print foo

This will output

I'm going on a journy

Hope I helped, feel free to ask further

Scrapy not following links on parse_dir_contents callback

Question

1 answers

solution1
1 2016-04-10 13:02:55

Scrapy not following links on parse_dir_contents callback

Question

1 answers

solution1 1 2016-04-10 13:02:55

solution1
1 2016-04-10 13:02:55