[英]How to webscrape password protected website
我有一個網站,我需要從中抓取一些數據(該網站是https://www.merriam-webster.com/ ,我想抓取保存的單詞)。
這個網站受密碼保護,我還認為有一些我不明白的 javascript 內容(我認為某些元素是由瀏覽器加載的,因為當我獲取 html 時它們沒有顯示)。
我目前有一個使用 selenium 的解決方案,它確實有效,但它需要打開 Firefox,我真的很想要一個解決方案,我可以讓它在后台僅作為控制台程序運行。
如果可能,使用 python 請求庫和最少的額外第三方庫,我將如何存檔?
這是我的硒解決方案的代碼:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
import json
# Create new driver
browser = webdriver.Firefox()
browser.get('https://www.merriam-webster.com/login')
# Find fields for email and password
username = browser.find_element_by_id("ul-email")
password = browser.find_element_by_id('ul-password')
# Find button to login
send = browser.find_element_by_id('ul-login')
# Send username and password
username.send_keys("username")
password.send_keys("password")
# Wait for accept cookies button to appear and click it
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, "accept-cookies-button"))).click()
# Click the login button
send.click()
# Find button to go to saved words
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, "ul-favorites"))).click()
words = {}
# Now logged in
# Loop over pages of saved words
for i in range(2):
print("Now on page " + str(i+1))
# Find next page button
nextpage = browser.find_element_by_class_name("ul-page-next")
# Wait for the next page button to be clickable
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, "ul-page-next")))
# Find all the words on the page
for word in browser.find_elements_by_class_name('item-headword'):
# Add the href to the dictonary
words[word.get_attribute("innerHTML")] = word.get_attribute("href")
# Naivgate to the next page
nextpage.click()
browser.close()
# Print the words list
with open("output.json", "w", encoding="utf-8") as file:
file.write(json.dumps(words, indent=4))
如果要使用requests
模塊,則需要使用會話。
要初始化會話,您可以:
session_requests = requests.session()
然后你需要一個帶有用戶名和密碼的負載
payload = {
"username":<USERNAME>,
"password":<PASSWORD>}
然后登錄你做:
result = session_requests.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
現在您的會話應該已登錄,因此要轉到使用相同會話的任何其他密碼保護頁面:
result = session_requests.get(
url,
headers = dict(referer = url)
)
然后你可以使用result.content
來查看該頁面的內容。
編輯如果您的站點包含 CSRF 令牌,則需要將其包含在“有效負載”中。 要獲取 CSRF 令牌,請將“有效負載”部分替換為:
from lxml import html
tree = html.fromstring(result.text)
#you may need to manually inspect the tree to find how your CSRF token is specified.
authenticity_token = list(set(tree.xpath("//input[@name='csrfmiddlewaretoken']/@value")))[0]
payload = {
"username":<USERNAME>,
"password":<PASSWORD>,
"csrfmiddlewaretoken":authenticity_token
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.