For a couple of hours I have been struggling with the following. I am trying to scrape https://www.upwork.com/jobs/_~0180b9eef40aafe057/ (and similar postings)
My xpath expression does work in the shell and in a xpath verifier but not in my code.
When I output the response into a text file using:
with open('response.html','w+') as f:
f.write(response.body)
and then test the xpath on the html code by using http://videlibri.sourceforge.net/cgi-bin/xidelcgi it is working fine.
This works in the Shell:
for item in response.xpath("//p[strong = 'About the Client']/following-sibling::p"):
print " ".join(map(unicode.strip, item.xpath(".//text()").extract()))
print 'Succes!'
But using it into my Scrapy Spider it returns nothing.
I've tried a lot of different solutions but nothing seems to work.
EDIT added complete code:
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ypscrape.items import item1
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.loader import ItemLoader
import arrow
import logging
import re
class MySpider(CrawlSpider):
# Login credentials for account, so more details are available
#Rss Token for the RSS feed which pulls the new links
rsstoken = 'REDACTED'
user = 'REDACTED'
password = 'REDACTED'
name = 'dataup'
allowed_domains = ['upwork.com']
login_page = 'http://www.upwork.com/login'
rssurl = 'https://www.upwork.com/ab/feed/jobs/rss?api_params=1&q=&securityToken='+ rsstoken
# can probably be removed
rules = (
Rule(LinkExtractor(allow=r'upwork\.com\/jobs\/.*?_%'), callback='parse_item', follow=False),
)
#Called when Spider is started, initiates the login request
def start_requests(self):
self.log("start request started")
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# Use the RSS feed to gather the newest links
def get_urls_from_rss(self, response):
urllist = []
content = response
self.log("Get rss from url")
#print str(content.body)
gathered_urls = re.findall('(https\:\/\/.*?\/jobs\/.*?)source=rss', str(content.body))
# Request the URLS and send them to parse_item
for url in gathered_urls:
if url not in urllist:
#Check if URL has not been visited before ADD THIS
urllist.append(url)
yield scrapy.Request(url, callback=self.parse_item)
def login(self, response):
"""Generate a login request."""
self.log("login request started")
return FormRequest.from_response(response,formname='login',
formdata={'login[username]': self.user, 'login[password]': self.password}
, callback=self.check_login_response, method="POST")
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
self.log("check request started")
if "<title>My Job Feed</title>" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
yield scrapy.Request(self.rssurl, callback = self.get_urls_from_rss )
else:
self.log("Bad times :( Logging in failed")
# Something went wrong, we couldn't log in, so nothing happens.
#return self.initialized()
def parse_item(self, response):
self.logger.info('Crawling item page! %s', response.url)
# Collect the data from the page
with open('response.html','w+') as f:
f.write(response.body)
for item in response.xpath("//p[strong = 'About the Client']/following-sibling::p"):
print " ".join(map(unicode.strip, item.xpath(".//text()").extract()))
print 'Bingo'
l = ItemLoader(item1(), response)
l.add_value('timestamp', arrow.utcnow().format('YYYY-MM-DD HH:mm'))
l.add_xpath('category1', '//*[@id="layout"]/div[2]/div[3]/div[1]/a/text()')
return l.load_item()
# Scrape data from page
EDIT 2: I think I found the solution. Replacing the xpath with
//p[strong]/strong
seems to solve the problem. What is the problem? I think that it has to do with the encoding. It can not find 'About the Client' because the response object it gets is something like ' About the Client ' with a couple of white spaces or something else encoding related. Thank you for the help
I've quickly put it into a spider:
import scrapy
class UpworkSpider(scrapy.Spider):
name = "upwork"
allowed_domains = ["upwork.com"]
start_urls = [
"https://www.upwork.com/jobs/_~0180b9eef40aafe057/",
]
def parse(self, response):
for item in response.xpath("//p[strong = 'About the Client']/following-sibling::p"):
print " ".join(map(unicode.strip, item.xpath(".//text()").extract()))
print 'Succes!'
Then, I run it as:
$ scrapy runspider spider.py
And getting:
Croatia Kastel Sucurac
04:56 PM
Succes!
4
Jobs Posted 0% Hire Rate,
4 Open Jobs
Succes!
in the output.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.