简体   繁体   English

Scrapy Xpath在Shell中无法使用代码

[英]Scrapy Xpath working in Shell not in code

For a couple of hours I have been struggling with the following. 几个小时以来,我一直在努力解决以下问题。 I am trying to scrape https://www.upwork.com/jobs/_~0180b9eef40aafe057/ (and similar postings) 我正在尝试抓取https://www.upwork.com/jobs/_~0180b9eef40aafe057/ (以及类似的帖子)

My xpath expression does work in the shell and in a xpath verifier but not in my code. 我的xpath表达式确实在外壳程序和xpath验证程序中有效,但在我的代码中无效。

When I output the response into a text file using: 当我使用以下命令将响应输出到文本文件中时:

    with open('response.html','w+') as f:
        f.write(response.body)

and then test the xpath on the html code by using http://videlibri.sourceforge.net/cgi-bin/xidelcgi it is working fine. 然后使用http://videlibri.sourceforge.net/cgi-bin/xidelcgi测试html代码上的xpath,它工作正常。

This works in the Shell: 这在Shell中有效:

for item in response.xpath("//p[strong = 'About the Client']/following-sibling::p"):
    print " ".join(map(unicode.strip, item.xpath(".//text()").extract()))
    print 'Succes!'

But using it into my Scrapy Spider it returns nothing. 但是,将其用于我的Scrapy Spider中后,不会返回任何结果。

I've tried a lot of different solutions but nothing seems to work. 我尝试了很多不同的解决方案,但似乎没有任何效果。

EDIT added complete code: 编辑添加了完整的代码:

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ypscrape.items import item1
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.loader import ItemLoader
import arrow

import logging
import re


class MySpider(CrawlSpider):

    # Login credentials for account, so more details are available
    #Rss Token for the RSS feed which pulls the new links
    rsstoken = 'REDACTED'
    user = 'REDACTED'
    password = 'REDACTED'

    name = 'dataup'
    allowed_domains = ['upwork.com']
    login_page = 'http://www.upwork.com/login'
    rssurl = 'https://www.upwork.com/ab/feed/jobs/rss?api_params=1&q=&securityToken='+ rsstoken

    # can probably be removed
    rules = (
    Rule(LinkExtractor(allow=r'upwork\.com\/jobs\/.*?_%'), callback='parse_item', follow=False),
    )

    #Called when Spider is started, initiates the login request
    def start_requests(self):
        self.log("start request started")
        yield Request(
            url=self.login_page,
            callback=self.login,
            dont_filter=True
        )

    # Use the RSS feed to gather the newest links
    def get_urls_from_rss(self, response):
        urllist = []
        content = response
        self.log("Get rss from url")
        #print str(content.body)
        gathered_urls = re.findall('(https\:\/\/.*?\/jobs\/.*?)source=rss', str(content.body))

        # Request the URLS and send them to parse_item
        for url in gathered_urls:
            if url not in urllist:
                #Check if URL has not been visited before ADD THIS
                urllist.append(url)
                yield scrapy.Request(url, callback=self.parse_item)

    def login(self, response):
        """Generate a login request."""

        self.log("login request started")
        return FormRequest.from_response(response,formname='login',
                    formdata={'login[username]': self.user, 'login[password]': self.password} 
                    , callback=self.check_login_response, method="POST")

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        self.log("check request started")

        if "<title>My Job Feed</title>" in response.body:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin..
            yield scrapy.Request(self.rssurl, callback = self.get_urls_from_rss )
        else:
            self.log("Bad times :( Logging in failed")
            # Something went wrong, we couldn't log in, so nothing happens.
        #return self.initialized()

    def parse_item(self, response):
        self.logger.info('Crawling item page! %s', response.url)

        # Collect the data from the page

        with open('response.html','w+') as f:
            f.write(response.body)

        for item in response.xpath("//p[strong = 'About the Client']/following-sibling::p"):
            print " ".join(map(unicode.strip, item.xpath(".//text()").extract()))
            print 'Bingo'



        l = ItemLoader(item1(), response)


        l.add_value('timestamp', arrow.utcnow().format('YYYY-MM-DD HH:mm'))
        l.add_xpath('category1', '//*[@id="layout"]/div[2]/div[3]/div[1]/a/text()')



        return l.load_item()
        # Scrape data from page

EDIT 2: I think I found the solution. 编辑2:我想我找到了解决方案。 Replacing the xpath with 将xpath替换为

//p[strong]/strong

seems to solve the problem. 似乎解决了问题。 What is the problem? 问题是什么? I think that it has to do with the encoding. 我认为这与编码有关。 It can not find 'About the Client' because the response object it gets is something like ' About the Client ' with a couple of white spaces or something else encoding related. 它找不到“关于客户端”,因为它获得的响应对象类似于带有两个空格的“关于客户端”或其他与编码相关的东西。 Thank you for the help 感谢您的帮助

I've quickly put it into a spider: 我迅速将其放入蜘蛛中:

import scrapy


class UpworkSpider(scrapy.Spider):
    name = "upwork"
    allowed_domains = ["upwork.com"]
    start_urls = [
        "https://www.upwork.com/jobs/_~0180b9eef40aafe057/",
    ]

    def parse(self, response):
        for item in response.xpath("//p[strong = 'About the Client']/following-sibling::p"):
            print " ".join(map(unicode.strip, item.xpath(".//text()").extract()))
            print 'Succes!'

Then, I run it as: 然后,我将其运行为:

$ scrapy runspider spider.py

And getting: 得到:

Croatia  Kastel Sucurac
            04:56 PM 
Succes!
 4
        Jobs Posted   0% Hire Rate,
        4 Open Jobs 
Succes!

in the output. 在输出中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM