简体   繁体   中英

Scrapy Python spider unable to find links using LinkExtractor or by manual Request()

I am trying to write a Scrapy spider that crawls through all the results pages on the domain: https://www.ghcjobs.apply2jobs.com... . The code should do three things:

(1) Crawl through all the pages 1-1000. These pages are identical, save for being differentiated by the final portion of the URL: &CurrentPage=#.

(2) Follow each link inside the results table containing job postings where the link's class = SearchResult. These are the only links within the table, so I am not in any trouble here.

(3) Store the information shown on the job description page in key:value JSON format. (This part works, in a rudimentary fashion)

I have worked with scrapy and CrawlSpiders before, using the 'rule = [Rule(LinkExtractor(allow=' method of recursively parsing a page to find all the links that match a given regex pattern. I am currently stumped on step 1, crawling through the thousand result pages.

Below is my spider code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http.request import Request
from scrapy.contrib.linkextractors import LinkExtractor
from genesisSpider.items import GenesisJob

class genesis_crawl_spider(CrawlSpider):
    name = "genesis"
    #allowed_domains = ['http://www.ghcjobs.apply2jobs.com']
    start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1']

    #allow &CurrentPage= up to 1000, currently ~ 512
    rules = [Rule(LinkExtractor(allow=("^https://www.ghcjobs.apply2jobs.com/ProfExt/
index.cfm\?fuseaction=mExternal.returnToResults&CurrentPage=[1-1000]$")), 'parse_inner_page')]

def parse_inner_page(self, response):
    self.log('===========Entrered Inner Page============')
    self.log(response.url)
    item = GenesisJob()
    item['url'] = response.url

    yield item

Here is the output of the spider, with a bit of the execution code on top cut off:

2014-09-02 16:02:48-0400 [genesis] DEBUG: Crawled (200) <GET https://www.ghcjobs
.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPa
ge=1> (referer: None) ['partial']
2014-09-02 16:02:48-0400 [genesis] DEBUG: Crawled (200) <GET https://www.ghcjobs
.apply2jobs.com/ProfExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToRes
ults> (referer: https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=
mExternal.returnToResults&CurrentPage=1) ['partial']
2014-09-02 16:02:48-0400 [genesis] DEBUG: ===========Entrered Inner Page========
====
2014-09-02 16:02:48-0400 [genesis] DEBUG: https://www.ghcjobs.apply2jobs.com/Pro
fExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToResults
2014-09-02 16:02:48-0400 [genesis] DEBUG: Scraped from <200 https://www.ghcjobs.
apply2jobs.com/ProfExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToResu
lts>
        {'url': 'https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?CurrentPag
e=1&fuseaction=mExternal.returnToResults'}
2014-09-02 16:02:48-0400 [genesis] INFO: Closing spider (finished)
2014-09-02 16:02:48-0400 [genesis] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 930,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 92680,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 9, 2, 20, 2, 48, 611000),
         'item_scraped_count': 1,
         'log_count/DEBUG': 7,
         'log_count/INFO': 7,
         'request_depth_max': 1,
         'response_received_count': 2,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2014, 9, 2, 20, 2, 48, 67000)}
2014-09-02 16:02:48-0400 [genesis] INFO: Spider closed (finished)

Currently, I am stuck on objective (1) of this project. As you can see, my spider only crawls through the start_url page. My regex should be targeting the page navigation buttons correctly as I have tested the regex. My callback function, parse_inner_page, is working, as is shown by the debugging comment I inserted, but only on the first page. Am I using 'Rule' incorrectly? I was thinking that maybe the page being HTTPS was somehow to blame...

Jut as a way to tinker a solution, I tried using a manual request for the second page of results; this didn't work. Here is the code for that too.

Request("https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2",  callback = 'parse_inner_page')

Can anyone offer any guidance? Is there maybe a better way to do this? I have been researching this on SO / Scrapy documentation since Friday. Thank you so much.

UPDATE: I have resolved the issue. The problem was with the start url I was using.

start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1'] 

Leads to a post-form-submission page that is the result of clicking the "search" button on This page. This runs javascript on the client side to submit a form to the server, which reports the full job board, pages 1-512. However, there exists another hard-coded URL which apparently calls the server without needing to use any client-side javascript. So now my start url is

start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.searchJobs']

And everything is back on track! In the future, check and see if there are any javascript independent URLs for calling server resources.

Are you sure Scrapy sees the web page in the same way as you? Nowadays, more and more sites are built by Javascript, Ajax .. And those dynamic content might need a fully functional browser to be fully populated. However, neither Nutch nor Scrapy will handle those out of box.

First of all, you need to make sure the web content you are interested in can be retrieved by scrapy. There are a few ways to do it. I usually use urllib2 and beautifulsoup4 to give it a quick try. And the your start page failed my test.

$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> url = "https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1"

>>> html = urllib2.urlopen(url).read()
>>> soup = BeautifulSoup(html)
>>> table = soup.find('div', {'id':'VESearchResults'})
>>> table.text
u'\n\n\n\r\n\t\t\tJob Title\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tArea of Interest\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tLocation\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tState\xa0\r\n\t\t\t\r\n\t\t\n\r\n\t\t\tCity\xa0\r\n\t\t\t\r\n\t\t\n\n\n\r\n\t\t\t\t\tNo results matching your criteria.\r\n\t\t\t\t\n\n\n'
>>> 

As you can see, "There is no results matching your criteria!" I think you might need to figure out why the content is not populated. Cookies? Post instead of Get? User Agent..etc.

Also, you can use scrapy parse command to help you debug. For example, I use this command quite often.

scrapy parse http://example.com --rules

A few other scrapy commands , maybe Selenium that might be helpful down the road.

Here I am using running scrapy shell in iPython to inspect your start url and also the first record that I can see in my browser contains Englewood and it doesn't exist in the html that scrapy grabbed

在这里,我使用的是在iPython中运行scrapy shell来检查您的起始url,以及我在浏览器中看到的第一条记录,其中包含Englewood,而在scrapy抓取的html中不存在。

Update:

What you are doing is a really trivial scraping work and you really don't need Scrapy, it is a bit overkill. Here are my suggestions:

  1. Take a look at Selenium (I am assuming you write Python) and make headless Selenium in the end when you try to run it on a server.
  2. You can implement this with PhantomJS which is a much lighter Javascript executor to get your work done. Here is another stackoverflow question that might be helpful.
  3. Several other resources that you can make a career in.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM