简体   繁体   中英

Error: raise ValueError(“No element found in %s” % response) occur when try to login with scrappy

Problem Description:

I want to crawl some info from the bbs of my college. Here is the address: http://bbs.byr.cn Below is the code of my spider:

from lxml import etree
import scrapy
try:
from scrapy.spiders import Spider
except:
from scrapy.spiders import BaseSpider as Spider
from scrapy.http import Request

class ITJobInfoSpider(scrapy.Spider):
name = "ITJobInfoSpider"
start_urls = ["http://bbs.byr.cn/#!login"]

def parse(self,response):
    return scrapy.FormRequest.from_response(
        response,
        formdata={'method':'post','id': 'username', 'passwd':'password'},
        formxpath='//form[@action="/login"]',
        callback=self.after_login
)

def after_login(self,response):
    print "######response body: " + response.body +"\n"
    if "authentication failed" in response.body:
        print "#######Login failed#########\n"
    return

However, with this code, I often get an Error: raise ValueError("No element found in %s" % response)

My Investigation:

I find that this Error happens when scrapy try to parse the HTML code of the url: http://bbs.byr.cn , scrappy parses the page with lxml. Below is the code

root = LxmlDocument(response, lxml.html.HTMLParser)
forms = root.xpath('//form')
if not forms:
    raise ValueError("No <form> element found in %s" % response)

So I look into the code with the code: print etree.tostring(root) And find that HTML element: </form> is parsed into &lt;/form&gt; no wonder the code forms = root.xpath('//form') will return an empty forms list.

But I don't know why this is happening, maybe the HTML code encoding? (The HTML code is encoded with GBK not UTF8.) Thanks advance for anyone who can help me out? BTW, if anyone want to write code against the website, I can give you an test account, pls leave me an email address in the comment.

Thanks a lot, guys!!

There seems to be some JavaScript redirection happening.

In this case using Splash would be overkill, though. Simply append /index to the start URL: http://bbs.byr.cn → http://bbs.byr.cn/index

This would be the complete working spider:

from scrapy import Spider
from scrapy.http import FormRequest

class ByrSpider(Spider):
    name = 'byr'
    start_urls = ['http://bbs.byr.cn/index']

    def parse(self, response):
        return FormRequest.from_response(
            response,
            formdata={'method':'post','id': 'username', 'passwd':'password'},
            formxpath='//form[@action="/login"]',
            callback=self.after_login)

    def after_login(self, response):
        self.logger.debug(response.text)
        if 'authentication failed' in response.text:
            self.logger.debug('Login failed')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM