使用Scrapy Python进行登录自动化和爬网

Question

我一直在试图写一个脚本来找回我接受的解决办法上SPOJ 查看更多

我被困在自动化登录过程中。 我发现Scrapy难以理解。 经过多次文档和代码示例之后，我对幕后发生的事情有一个模糊的印象，这就是我现在所处的位置：

（我已在所需位置注释了该代码）

import os
import os.path
import scrapy
import urllib.request
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from bs4 import BeautifulSoup

class LoginSpider(scrapy.Spider):
    name = 'spoj'
    start_urls = ['http://www.spoj.com/login']
    outputFile = open('output.txt' , 'w')

    def parse(self, response):
        username = input('Enter username\n')
        password = input('Enter password\n')
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': username, 'password': password},
            callback=self.after_login
        )

    def after_login(self, response):

        # Even if I type in correct username and password it always leads to 
        # authentication faliure and the following if statement evaluates to true.

        if str.encode('Authentication failed!') in response.body:
            self.logger.error("Login failed")
            print ('Incorrect credentials')
            return    

        print('lol') # ofcourse this isn't printed
        return scrapy.Request(url = "http://www.spoj.com/myaccount/" , callback = self.retrieve_codes ) 

    # needless to say, the following function is never called
    def retrieve_codes(self, response):

        print('Hello testing!') 
        url = 'http://www.spoj.com/files/src/16731976/'
        html = urllib.request.urlopen(url).read()
        soup = BeautifulSoup(html , 'html.parser')
        self.outputFile.write(str(soup.prettify()))

在文档中if "authentication failed" in response.body:我改为

if str.encode('Authentication failed!') in response.body:原因为

我收到此错误， a byte like object is required not 'str'这样a byte like object is required not 'str'
在spoj中输入错误的凭据Authentication failed! 显示且未authentication failed 。 在这里我们需要精确。

请告诉我我在哪里做错了。 我没有在网上找到任何很好的资源来详细讨论表单验证问题。 从文档中看到此代码后，我最初的问题是，

这是唯一的方法吗？
该方法适用于每个网站吗？ 因为我了解到此过程的复杂性因站点而异。
我可以找到背后发生的情况的更具描述性的解释吗？

我也曾尝试使用robobrowser，但徒劳。 我有点期待像美丽汤一样的优质文档。

谢谢！

Answer 1

您使用了错误的formdata字段名称。 您需要将示例代码从草率文档调整为特定网站。 当前，您使用username和password作为formdata字段。

如果您在登录时使用浏览器的开发人员工具，则可以看到POST发送的字段标记为login_user和password 。

因此，这应该很容易解决:-)

使用Scrapy Python进行登录自动化和爬网

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-05-12 13:03:45

使用Scrapy Python进行登录自动化和爬网

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-05-12 13:03:45

解决方案1
2 已采纳 2017-05-12 13:03:45