简体   繁体   English

BeautifulSoup登录-如何获取具有特定属性和值的crsf字段

[英]BeautifulSoup login - How to get the crsf field with a specific attribute and value

I am using the following script to authenticate logging into LinkedIn and then using Beautiful Soup to scrape the HTML. 我正在使用以下脚本来验证登录到LinkedIn的身份,然后使用Beautiful Soup抓取HTML。

The login authenticates with no issue (I see my account info) but when I try to load the page I get a "fs.config({"failureRedirect})" error. 登录身份验证没有问题(我看到我的帐户信息),但是当我尝试加载页面时,出现“ fs.config({“ failureRedirect})”错误。

import cookielib
import os
import urllib
import urllib2
import re
import string
import sys
from bs4 import BeautifulSoup

username = "MY USERNAME"
password = "PASSWORD"

ofile = open('Text_Dump.txt', "wb")

cookie_filename = "parser.cookies.txt"

class LinkedInParser(object):

    def __init__(self, login, password):
        """ Start up... """
        self.login = login
        self.password = password

        # Simulate browser with cookies enabled
        self.cj = cookielib.MozillaCookieJar(cookie_filename)
        if os.access(cookie_filename, os.F_OK):
            self.cj.load()
        self.opener = urllib2.build_opener(
            urllib2.HTTPRedirectHandler(),
            urllib2.HTTPHandler(debuglevel=0),
            urllib2.HTTPSHandler(debuglevel=0),
            urllib2.HTTPCookieProcessor(self.cj)
        )
        self.opener.addheaders = [
            ('User-agent', ('Mozilla/4.0 (compatible; MSIE 6.0; '
                           'Windows NT 5.2; .NET CLR 1.1.4322)'))
        ]

        # Login
        title = self.loginPage()

        sys.stderr.write("Login"+ str(self.login) + "\n")

        #title = self.loadTitle()
        ofile.write(title)

    def loadPage(self, url, data=None):
        """
        Utility function to load HTML from URLs for us with hack to continue despite 404
        """
        # We'll print the url in case of infinite loop
        # print "Loading URL: %s" % url
        try:
            if data is not None:
                response = self.opener.open(url, data)
            else:
                response = self.opener.open(url)
            return ''.join(response.readlines())
        except:
            # If URL doesn't load for ANY reason, try again...
            # Quick and dirty solution for 404 returns because of network problems
            # However, this could infinite loop if there's an actual problem
            return self.loadPage(url, data)

    def loginPage(self):
        """
        Handle login. This should populate our cookie jar.
        """
        html = self.loadPage("https://www.linkedin.com/")
        soup = BeautifulSoup(html)
        csrf = soup.find(id="csrfToken-postModuleForm")['value']

        login_data = urllib.urlencode({
            'session_key': self.login,
            'session_password': self.password,
            'loginCsrfParam': csrf,
        })

        html = self.loadPage("https://www.linkedin.com/uas/login-submit", login_data)

        return

    def loadTitle(self):
        html = self.loadPage("https://www.linkedin.com/")
        soup = BeautifulSoup(html)
        return soup.get_text().encode('utf-8').strip()

parser = LinkedInParser(username, password)
ofile.close()

The script for the login came from: Logging in to LinkedIn with python requests sessions 登录脚本来自: 使用python请求会话登录LinkedIn。

Any thoughts? 有什么想法吗?

your syntax is wrong 您的语法错误

first - the crsf is an input field not a div tag / inspect element and you will see 首先-crsf是一个输入字段,而不是div标签/ inspect元素,您将看到

second - to find a tag with a specified attribute and value you need to use .find('type_of_tag' :{'tag_attribute':'value'}) 第二个-要查找具有指定属性和值的标签,您需要使用.find('type_of_tag' :{'tag_attribute':'value'})

third to access the value of a specific attribute's value within the specified tag you need to use bracket syntax or .get() 第三次访问指定标签中特定属性值的值,您需要使用方括号语法或.get()

here is your code that you have to replace 这是您必须替换的代码

html = self.loadPage("https://www.linkedin.com/")
soup = BeautifulSoup(html)
csrf = soup.find('input', {"name" : "csrfToken"})
csrf_token = csrf['value']
print csrf_token

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM