簡體   English   中英

使用BeautifulSoup和Python 2.7使用Google登錄到網站

[英]Login to a website with Google using BeautifulSoup and Python 2.7

我正在為Quora編寫Python網絡抓取工具,但需要使用Google登錄。 我已經在網上搜索過,但是沒有一個可以滿足我的問題。 這是我的代碼:

# -*- coding: utf-8 -*-
import mechanize
import os
import requests
import urllib2
from bs4 import BeautifulSoup
import cookielib

# Store the cookies and create an opener that will hold them
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

# Add our headers
opener.addheaders = [('User-agent', 'RedditTesting')]

# Install our opener (note that this changes the global opener to the one
# we just made, but you can also just call opener.open() if you want)
urllib2.install_opener(opener)

# The action/ target from the form
authentication_url = 'https://quora.com'

# Input parameters we are going to send
payload = {
  'op': 'login-main',
  'user': '<username>',
  'passwd': '<password>'
}

# Use urllib to encode the payload
data = urllib.urlencode(payload)

# Build our Request object (supplying 'data' makes it a POST)
req = urllib2.Request(authentication_url, data)

# Make the request and read the response
resp = urllib2.urlopen(req)
contents = resp.read()




 # specify the url
 quote_page = "https://www.quora.com/"

 # query the website and return the html to the variable ‘page’
 page = urllib2.urlopen(quote_page)

 # parse the html using beautiful soup and store in variable `soup`
 soup = BeautifulSoup(page, 'html.parser')
 # Take out the <div> of name and get its value
 name_box = soup.find('div', attrs={"class": "ContentWrapper"})

 name = name_box.text.strip() # strip() is used to remove starting and    trailing

 print name


 for link in soup.find_all('img'):
    image = link.get("src")

    os.path.split(image)
    image_name = os.path.split(image)[1]
    print(image_name)
    r2 = requests.get(image)
    with open(image_name, "wb") as f:
       f.write(r2.content)

由於我沒有該網站的實際用戶名,因此我使用了自己的Gmail帳戶。 為了登錄,我使用了來自其他問題的一些代碼,但這不起作用。

任何縮進錯誤均歸因於我的格式不正確。

要登錄和抓取,請使用Session; 使用您的憑據作為有效內容進行POST請求,然后進行抓取。

import requests
from bs4 import BeautifulSoup

with requests.Session() as s:

    p = s.post("https://quora.com", data={
        "email": '*******',
        "password": "*************"
    })
    print(p.text)

    base_page = s.get('https://quora.com')
    soup = BeautifulSoup(base_page.content, 'html.parser')
    print(soup.title)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM