简体   繁体   English

如何抓取受密码保护的网站

[英]How to webscrape password protected website

I have a website from which I need to scrape some data (The website is https://www.merriam-webster.com/ and I want to scrape the saved words).我有一个网站,我需要从中抓取一些数据(该网站是https://www.merriam-webster.com/ ,我想抓取保存的单词)。

This website is password protected, and I also think there is some javascript stuff going on that I don't understand (I think certain elements are loaded by the browser since they don't show up when I wget the html).这个网站受密码保护,我还认为有一些我不明白的 javascript 内容(我认为某些元素是由浏览器加载的,因为当我获取 html 时它们没有显示)。

I currently have a solution using selenium, it does work, but it requires firefox to be opened, and I would really like a solution where I can let it run as a console only programm in the background.我目前有一个使用 selenium 的解决方案,它确实有效,但它需要打开 Firefox,我真的很想要一个解决方案,我可以让它在后台仅作为控制台程序运行。

How would I archieve this, if possible using pythons requests library and the least amount of additional third party librarys?如果可能,使用 python 请求库和最少的额外第三方库,我将如何存档?

Here is the code for my selenium solution:这是我的硒解决方案的代码:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
import json

# Create new driver
browser = webdriver.Firefox()
browser.get('https://www.merriam-webster.com/login')

# Find fields for email and password
username = browser.find_element_by_id("ul-email")
password = browser.find_element_by_id('ul-password')
# Find button to login
send = browser.find_element_by_id('ul-login')
# Send username and password 
username.send_keys("username")
password.send_keys("password")

# Wait for accept cookies button to appear and click it
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, "accept-cookies-button"))).click()
# Click the login button
send.click()

# Find button to go to saved words
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, "ul-favorites"))).click()


words = {}
# Now logged in
# Loop over pages of saved words
for i in range(2):
    print("Now on page " + str(i+1))
    # Find next page button
    nextpage = browser.find_element_by_class_name("ul-page-next")
    # Wait for the next page button to be clickable
    WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, "ul-page-next")))

    # Find all the words on the page
    for word in browser.find_elements_by_class_name('item-headword'):
        # Add the href to the dictonary
        words[word.get_attribute("innerHTML")] = word.get_attribute("href")
    # Naivgate to the next page
    nextpage.click()

browser.close()

# Print the words list
with open("output.json", "w", encoding="utf-8") as file:
    file.write(json.dumps(words, indent=4))

If you want to use the requests module you need to use a session.如果要使用requests模块,则需要使用会话。

To initialise a session you do:要初始化会话,您可以:

session_requests = requests.session()

Then you need a payload with the username and password然后你需要一个带有用户名和密码的负载

payload = {
    "username":<USERNAME>,
    "password":<PASSWORD>}

Then to log in you do:然后登录你做:

result = session_requests.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)

Now your session should be logged in, so to go to any other password protect page you use the same session:现在您的会话应该已登录,因此要转到使用相同会话的任何其他密码保护页面:

result = session_requests.get(
    url, 
    headers = dict(referer = url)
)

Then you can use result.content to view the content of that page.然后你可以使用result.content来查看该页面的内容。

EDIT if your site includes a CSRF token you will need to include that in the `payload'.编辑如果您的站点包含 CSRF 令牌,则需要将其包含在“有效负载”中。 To get the CSRF token replace the "payload" section with:要获取 CSRF 令牌,请将“有效负载”部分替换为:

from lxml import html

tree = html.fromstring(result.text)
#you may need to manually inspect the tree to find how your CSRF token is specified.
authenticity_token = list(set(tree.xpath("//input[@name='csrfmiddlewaretoken']/@value")))[0]

payload = {
    "username":<USERNAME>,
    "password":<PASSWORD>,
    "csrfmiddlewaretoken":authenticity_token
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM